There is a joke among robotics researchers that goes: language models know everything about the world except how to exist in it. A large language model trained on the entire internet knows that a coffee cup is a cylinder that can be grasped around its circumference, that it is fragile, that it typically sits on flat surfaces, that it becomes hot when filled with hot liquid. What it does not know — what no amount of text training can teach — is what it feels like to pick one up. The weight distribution, the compliance of the handle, the way it pivots unpredictably if you grasp it slightly off-center. That knowledge lives in the body, not in language.
This gap — between knowing about things and knowing how to interact with them — is what the field of embodied AI is trying to close. The premise is that truly general intelligence requires not just the ability to process and generate language but the ability to learn through physical interaction with a real, three-dimensional, unpredictable world. And the researchers pursuing this premise are starting to produce results that suggest the approach may be right.
Why the Body Matters for Intelligence
The case for embodiment in intelligence is both philosophical and practical. Philosophically, a significant current in cognitive science holds that human intelligence is not just brain computation but brain-body-world computation — that our concepts, our reasoning, and our language are all shaped by the experience of having a body that moves through space, manipulates objects, and interacts with other bodies. On this view, the disembodied intelligence of a language model is not a simplification of human intelligence but a fundamentally different kind of thing, with capabilities and limitations that differ in ways that matter.
Practically, the limitations of disembodied language models are visible in any task that requires physical common sense. Ask GPT-5 to explain how to change a tire and it will produce a clear, accurate, step-by-step explanation. Ask it to actually change the tire — to control a robot doing the task — and the gap between verbal knowledge and physical capability becomes immediately apparent. Grasping, manipulating, navigating — these require a different kind of knowledge than language models acquire from text.
The Approaches Being Pursued
Several distinct approaches to embodied AI are active in 2025 and 2026, representing different intuitions about the path to physical intelligence.
Simulation-to-real transfer — training AI systems in physics simulation environments and then deploying them in real robots — has become the dominant paradigm for robot learning at scale. Simulated environments allow training at speeds impossible in the physical world: a simulated robot can experience thousands of hours of interaction in the time it would take a physical robot to experience one. The challenge is the simulation-to-reality gap — real environments have physical properties that simulations approximate imperfectly, and models trained in simulation often fail in the real world on the exact edge cases that simulation did not accurately represent. Closing this gap is a major active research area.
Foundation models for robotics — applying the approach that produced large language models to the problem of robotic control — are being developed by Google DeepMind, OpenAI, and a cohort of startups. RT-2, developed by Google DeepMind, trained a single model on a combination of internet text, images, and robot interaction data, and demonstrated that the resulting model could generalize to novel manipulation tasks by drawing on its language and visual understanding. The model did not just learn to perform trained tasks — it learned something like physical common sense that transferred to situations it had never seen.
Physical AI — Nvidia term for AI systems designed specifically to operate in the physical world — has been the framing for Nvidia investment in robotics infrastructure. The Jetson platform provides AI computing for robots and autonomous systems. Project GR00T is a foundation model project aimed at humanoid robots. Nvidia has positioned itself as the infrastructure provider for physical AI in the same way it became the infrastructure provider for language AI — an ambitious bet that the embodied AI wave will be as significant as the language AI wave.
The Humanoid Bet
The most attention-grabbing development in embodied AI has been the sudden proliferation of humanoid robots. Figure AI, Physical Intelligence, Agility Robotics, Boston Dynamics, Tesla Optimus, and Unitree are all developing humanoid robot platforms with varying levels of AI integration and varying degrees of commercial readiness. The investment in humanoid robots over 2023 to 2025 reached several billion dollars — a scale of investment that reflects genuine belief that the technology is approaching a commercial inflection point.
The argument for humanoid form factor is that the world is built for humans — doors, stairs, workbenches, tools are all designed to be operated by human-shaped bodies. A robot that can operate in human environments without modifications to those environments is far more practically useful than a robot that requires a custom-designed environment. If you can build a humanoid robot with the physical capability and AI intelligence to do general-purpose manual labor, the addressable market is essentially everything humans do with their hands.
The gap between current humanoid robot capability and that vision remains large. The best current systems can perform specific manipulation tasks in structured environments with careful setup. General-purpose dexterous manipulation — picking up arbitrary objects in arbitrary configurations, using tools designed for human hands, recovering from unexpected physical perturbations — is still well beyond what any current system can do reliably. The researchers building these systems believe the gap will close within years, not decades. That belief is what is driving the investment. Whether it is correct is the question that will determine whether the current investment wave produces a new industry or a cautionary tale.
The Timeline Question
Unlike language AI, where progress has been dramatic and visible on annual timescales, embodied AI progress is slower, harder to measure, and more dependent on hardware that is difficult and expensive to build and iterate on. The optimists in the field believe that combining large-scale simulation training with foundation model approaches will accelerate robot learning in the way that foundation models accelerated language AI — producing a step-change in capability rather than incremental improvement.
The skeptics point out that robotics has been five years away from general capability for the past thirty years, and that the genuine difficulty of physical interaction with the world — the infinite variety of objects, surfaces, lighting conditions, and unexpected perturbations — may not yield to the same approach that worked for language. Text, at the end of the day, is a finite-dimensional structure with statistical regularities that can be learned from large datasets. The physical world is different in ways that may require different approaches.
Both views contain truth. The pace of progress in embodied AI over the next five years will be one of the most consequential and most watched developments in the field. Whether AI finally gets a body that works — one that can do the things only bodies can do — will determine a great deal about what artificial intelligence means for the physical world and for the human labor that currently operates within it.
