Skip to main content

World models rewrite the rules of robotics: Why robots are starting to learn from videos on the internet

AI article illustration for ai-jarvis.eu
World models represent a fundamental turning point in robotics. Instead of engineers manually programming physics simulations, these neural networks learn to understand the physical world by observing millions of hours of videos — much like a child who learns that a ball will fall off a table without having to solve Newton's laws. According to analysis by AWS and Striker Venture Partners, this approach promises to overcome the biggest obstacle that has held robotics back so far: the lack of training data.

Listen to this article:

How robots learn about the world — from rules to observation

Just twenty years ago, natural language processing relied on hand-written grammar rules. Linguists spent decades codifying syntax, until language models like GPT showed that a machine can learn grammar on its own — simply by reading trillions of tokens from the internet. Robotics today stands exactly where NLP was in 2005.

Every physics simulation for robots is built on hand-coded physical laws — friction coefficients, collision dynamics, contact models. A robot trained in such a simulation works brilliantly in the environment it was created for. But move it into an unfamiliar kitchen and you hit structural failure. Hand-coded physics cannot be scaled to general physical intelligence, just as hand-written grammar could not lead to GPT-4.

Data: Why YouTube is changing the game

Language models had a massive advantage: an internet full of trillions of tokens, digital and free. GPT-4 trained on approximately 13 trillion tokens. By comparison — the Open X-Embodiment dataset, which involved 34 robotics labs from around the world, contains only about a million robotic trajectories. Each one requires physical hardware, human supervision, and careful curation work.

A world model is a neural network that learns physical intuition through observation — exactly like a human. Show it millions of hours of videos: cooking tutorials, factory floors, traffic, construction sites. It starts building an internal representation of how the world works. This implicit understanding is often more reliable than hand-written models, because it holds up even in situations no engineer could have predicted.

The key insight of the last two years is: once you have strong world knowledge, you only need a minimum of robot-specific data (action knowledge). V-JEPA 2 from Meta, trained on over a million hours of internet videos, achieved 80% success rate in zero-shot object manipulation in labs it had never seen before. DeepMind Dreamer 4 learned to collect diamonds in Minecraft — which requires 20,000 sequential decisions — without any interaction with the environment.

Five architectures, one open question

Under the term "world model" lie five fundamentally different approaches. They all agree that hand-coded simulations are not enough. On everything else, they diverge:

Video-generative models — NVIDIA Cosmos (14 billion parameters), Runway GWM-1, and DeepMind Genie 3 predict future video frames. Genie 3 runs at 24 fps and functions as a live, playable simulation. The downside? Generating every pixel is extremely expensive.

Latent models — DeepMind's Dreamer series (Danijar Hafner, Timothy Lillicrap) creates a simplified internal representation of the world instead of pixels. Dreamer V3 outperformed specialized methods on over 150 tasks without any tuning whatsoever.

JEPA (Joint Embedding Predictive Architecture) — Yann LeCun's architecture predicts abstract representations instead of raw pixels. LeCun believes in this approach so strongly that he founded AMI Labs in Paris, which raised 1.03 billion dollars at a valuation of 3.5 billion — before delivering its first product.

Native multimodal reasoning — The DIAMOND model achieved the highest score of all world models on the Atari 100k benchmark. This approach argues that physical understanding must be built from the ground up together with text, image, and sound — not later "bolted on" to a text model.

Diffusion world models — UniSim showed that a single diffusion model can simulate the interaction of both humans and robots with the physical world. This is the least tested but potentially the most universal approach.

Evidence from the last 18 months suggests hybridization — understanding of physics emerges across all architectures as they grow. The differences between them are blurring, and the decisive factor may not be architecture, but speed of transition from research to production.

Economics: Capabilities are growing, costs aren't yet

The parameters of world models have grown a thousandfold in five years — from 2 million with PlaNet to 14 billion with Cosmos. Training Cosmos consumed 10,000 H100 GPUs over three months. At this scale, physical understanding begins to emerge as an unplanned side effect — OpenAI observed this with Sora (3D consistency, object permanence) and DeepMind with Genie 2.

The fundamental problem, however, is the operational cost. While a text language model costs approximately 15 cents per 100 user hours, Sora costs about 468 dollars per user hour. Even the more efficient Odyssey requires dedicated H100s at 50 dollars per hour. The reason is structural — video must be generated continuously, frame by frame, in real time. You can't spread fifty users across a single GPU as you can with text models.

The good news is that we've already seen a similar cost decline curve — LLM inference has become a thousand times cheaper in three years. Decart already reports a 400-fold reduction in video costs thanks to its own inference engine.

NVIDIA has assembled the most comprehensive vertically integrated stack in this space: Cosmos for training world models, Isaac Sim for physics simulation, GR00T for humanoid models, Omniverse for digital twins, and Jetson Thor for edge deployment. AWS provides the cloud layer — SageMaker for training, Inferentia for optimized inference, and IoT Greengrass for fleet management.

Three structural gaps that will determine the timeline

The data gap. Video captures how things look, but not what their surface is like. Tasks requiring touch — manipulation of fragile materials, inserting components — cause a dramatic drop in model performance. The human hand has 17,000 touch receptors. Most deployed robotic hands have no tactile sensors. A standardized tactile dataset at the required scale does not exist — not due to a lack of technology, but due to a failure of coordination across the field.

The architectural gap. Most robotics companies today rely on imitation learning (learning by copying). When the GRASP lab at UPenn tested robots trained this way in genuinely new conditions, the success rate dropped to 16.7%. In contrast, every documented case of a robot that worked for 10 or more hours without human intervention used reinforcement learning with a world model in the background.

Temporal coherence. Video-generative world models simulate convincing physics over short time spans. The longer the simulation runs, the more errors accumulate — objects end up in the wrong places, reactions stop making sense. Genie 3 maintains coherence for only a few minutes. World Labs' Marble drift handles it better, but at a higher cost. The main challenge: finding the balance between simulation fidelity and cost.

What this means for the Czech Republic and Europe

The transition from hand-written simulations to learned world models has direct implications for European industry as well. The Czech Republic, as an industrial powerhouse with a growing labor shortage, is among the countries where robotization has enormous potential. World model research can significantly shorten the time needed to deploy robots into production — instead of months of programming for a specific workplace, a future robot could understand a new environment within minutes.

The European Union, moreover, through the AI Act is creating a regulatory framework that will also influence the deployment of physical AI. For European companies, it will be crucial to monitor whether world models remain open (NVIDIA, Meta, and Physical Intelligence all opened their models in 2025), or whether key infrastructure closes behind corporate walls. The world model and physical AI ecosystem has already attracted over 3 billion dollars in venture capital, and companies manufacturing humanoid hardware have a combined valuation of over 50 billion dollars.

What is the difference between a world model and a classical physics simulation?

A classical simulation uses hand-programmed physics equations (friction, gravity, collisions). A world model learns physics on its own — by observing videos and interactions, similar to how a human does. The advantage is that a world model can handle situations that engineers didn't think of when programming the simulation. For example, V-JEPA 2 from Meta achieved an 80% success rate in object manipulation in labs it had never seen before.

When will we see robots controlled by world models in Czech factories?

It is still largely a research phase. The main obstacle is the high operational cost — generating video in real time costs hundreds of dollars per hour. However, it is estimated that, similarly to language models (where inference became 1000× cheaper in three years), costs will dramatically drop. The first commercial deployments in industry can be expected in the 3–5 year horizon, with the Czech Republic and its new AI center in Ostrava having a good starting position.

Do world models require special hardware?

For training, yes — NVIDIA's Cosmos used 10,000 H100 GPUs over three months. For the actual deployment (inference) in a robot, however, specialized chips like NVIDIA Jetson Thor are already being used, which are designed for edge operation, meaning directly in the robot. AWS offers Inferentia chips for optimized cloud inference. As with language models, a rapid decline in hardware requirements is expected.

X

Don't miss out!

Subscribe for the latest news and updates.