
For the better part of a decade, the trajectory of artificial intelligence followed a simple formula: take a larger neural network, feed it more text, and watch performance improve. The scaling laws were remarkably predictable, until they were not. As building bigger and better chatbots gets harder, more expensive, and yields diminishing returns, a growing contingent of AI researchers is pursuing a fundamentally different approach: teaching AI systems to learn by acting in simulated 3D worlds.
The shift, documented in a comprehensive feature in Science by Matthew Hutson, represents a recognition that next-token prediction, no matter how many parameters or trillions of tokens it is trained on, may never produce the kind of causal, embodied understanding that characterizes human-level intelligence.
“The notion that simply scaling an LLM will get to AGI is complete nonsense,” Yann LeCun, chief scientist at the newly formed AMI Labs, told Science. “It’s like saying you’re going to get into orbit by scaling airplanes.”
From words to worlds
The new paradigm is often called “world models”, neural networks that learn to simulate the physical world rather than simply process language. Unlike LLMs, which learn statistical patterns in text, world models learn causal relationships: that a cup falls when pushed off a table, that water flows downhill, that objects occlude one another.
Two sub-approaches have emerged. In offline world models, agents train by trial and error inside simulations, then transfer their skills to the real world. In online world models, agents carry an internal predictive model that allows them to simulate the consequences of actions mentally before executing them, planning, reasoning, and correcting course in a way that looks far more like animal cognition than pattern matching.
“AI has development backwards,” said Brenden Lake of Princeton University. LLMs start with language instead of the embodied exploration that human infants use to learn about physics, causality, and object permanence. The result, Lake argues, is systems that are “so alien and so unhumanlike” that they cannot serve as a foundation for general intelligence.
The money is following
The shift is not merely theoretical. Major investments are flowing into world-model research:
Google DeepMind has developed Genie 3, a system that generates fully interactive, photorealistic 3D worlds in real-time (20-24 fps at 720p) from text prompts or images. It models physics, water, lighting, terrain, and can now integrate Google Maps data for realistic simulation. DeepMind’s SIMA 2 agent navigates and follows instructions in commercial video games it has never seen before, including Valheim, No Man’s Sky, and Goat Simulator 3, and can even operate in Genie 3-generated worlds it encountered for the first time.
NVIDIA is pursuing world models for robotics through its GR00T platform, training humanoid robots inside the Isaac Sim physics simulation. The company’s Cosmos world foundation models generate synthetic training data, and its DreamZero system lets robots predict how the world will evolve after an action.
Yann LeCun’s AMI Labs, funded with $1.03 billion from NVIDIA, Samsung, and Bezos Expeditions, is building the LeWorldModel, a compact world model with only 15 million parameters (compared with hundreds of billions for frontier LLMs) that can train in a few hours on a single GPU. It has achieved a 96% success rate on the robotic Push-T benchmark, outperforming far larger systems. A formal proof published in May 2026 (arXiv) shows that LeCun’s LeJEPA architecture achieves linear identifiability, it can recover true underlying causal variables (position, velocity, orientation) from raw pixels alone.
General Intuition, a startup founded by Adam Jelley, Pim de Witte, and Eloi Alonso, is training world models on more than 2 billion gameplay clips per year from Medal’s gaming platform (10 million monthly active users). The company raised $134 million in seed funding and is reportedly raising $300 million at a $2 billion valuation, with backing from Jeff Bezos, Eric Schmidt, and Vinod Khosla.
World Labs, founded by Fei-Fei Li, raised $1 billion from AMD, NVIDIA, Autodesk, and Fidelity for its “spatial intelligence” platform, Marble, which generates persistent, editable 3D environments from text, images, or video.
Why chatbots hit a wall
The scaling approach faces three fundamental constraints. First, the power law of scaling means each additional performance gain requires disproportionately more compute, data, and parameters, and the cost is now in the hundreds of billions of dollars. Second, high-quality public text data is approaching exhaustion; a 2024 study estimated the available stock will be depleted within a few years. Third, and most fundamentally, next-token prediction does not build causal models. An LLM can generate a plausible sentence about a ball falling off a table, but it has no internal representation of gravity, momentum, or object permanence. It cannot predict what would happen in a novel situation it has not seen in training text.
“The smartest systems we have today are not as smart as a house cat,” LeCun said.
The open question
Not everyone is convinced that embodiment is necessary. Jared Kaplan of Anthropic, a co-author of the original 2020 scaling paper that defined the LLM era, told Science: “Some people have suggested that you can’t train AGI without embodiment, and I’m personally very skeptical of that.”
General Intuition’s de Witte frames the question in terms of cost: “Can LLMs develop implicit world models, or is explicit simulation necessary? The question is at what cost.”
The answer may determine not just the future of AI research, but the shape of the technology itself. If world models prove essential for robust reasoning, planning, and physical understanding, the years of scaling LLMs will look like a detour, a productive one, but a detour nonetheless. If Kaplan is right and implicit world models can emerge from sufficiently large language models trained on sufficiently diverse data, the detour may turn out to have been the most direct route after all.
Either way, the field is no longer betting exclusively on text.
Source:
[Science AAAS] Hutson M. “As better chatbots get harder to build, AI turns to simulated worlds.” Science, Vol. 392, Issue 6805, June 25, 2026. https://www.science.org/content/article/better-chatbots-get-harder-build-ai-turns-simulated-worlds

