Sergey Levine: The Future Of Robotic Foundation Models

Cover

📺 Today’s recommended deep-dive video: https://www.youtube.com/watch?v=48pxVdmkMIE

The Robotics Flywheel: Sergey Levine on the Path to General-Purpose Automation

Robotics is shifting from a niche laboratory science into an industrial-scale engineering challenge. Sergey Levine, co-founder of Physical Intelligence, explains how the same “foundation model” principles that revolutionized language are finally being applied to the physical world.

Core Question: How soon will general-purpose robotic models achieve human-level dexterity and autonomy in unpredictable environments?

Highlights

General-purpose robotic models are now able to perform diverse tasks like folding laundry and assembling boxes using a single, unified architecture.
The “robotics flywheel” is expected to start within 1–2 years, with significant autonomous home and industrial capabilities arriving in roughly five years.
Modern robotics benefits from “common sense” inherited from pre-trained Vision-Language Models, allowing robots to understand context that was previously impossible.
Hardware costs are plummeting, with research-grade robot arms dropping from $400,000 to $3,000 as intelligent software compensates for mechanical imprecision.

⏱️ Reading time: approx. 8 minutes · Saves you about 80 minutes vs. watching.

Want to take notes while watching? Click the image below and let AI Notebook capture the key points for you 👇

The Shift to General-Purpose Robotic Brains

Beyond the Science Experiment

For decades, robotics research was treated as a series of isolated science experiments where each robot was hard-coded for a specific, narrow task. Physical Intelligence is changing this paradigm by building robotic foundation models designed to control any hardware for any purpose. Levine notes that we have already moved past the “basics” of dexterous manipulation, such as folding laundry or cleaning a kitchen, but these are merely the building blocks for much larger, long-term agency.

The ultimate goal isn’t just a robot that folds a T-shirt; it is a system that can manage a household for six months without intervention.

This requires a fundamental shift in how we perceive robotic intelligence. We are moving away from the era of “scripted” movements and into the era of continuous learning. Once these systems are deployed in the real world, they can leverage a “flywheel” effect where every interaction provides more data to improve future performance. Levine estimates that we are only a few years away from the point where this self-sustaining loop begins to accelerate rapidly.

A functional process map showing the 'Robotics Flywheel': 1. Model Deployment leads to 2. Real-world Interaction, which generates 3. Experience Data (Successes/Failures), which is fed into 4. Automated & Human Labeling, resulting in 5. Model Fine-tuning, which loops back to improved Deployment.

💡 Digging Deeper

Q: Why is 2025 different from the early days of self-driving cars in 2009?
A: Perception and “common sense” have finally been solved by LLMs and VLMs. In 2009, a robot couldn’t understand a “slippery floor” sign; today’s models can infer the physical consequences of such a sign without ever having fallen.

Q: Is manipulation harder than driving?
A: In some ways, yes, because of the dexterity required. However, manipulation is safer for learning because a robot can drop a dish and recover, whereas a car cannot “recover” from a high-speed collision to learn from its mistake.

Q: What is the primary bottleneck for scaling these models today?
A: It is no longer just the algorithms; it is the “axes of scale.” We need to identify exactly which data types—whether they are high-frequency motor actions or high-level linguistic instructions—contribute most to robustness and edge-case handling.

Architecture: Merging Language with Motor Control

The Action Expert and the VLM

The technical heart of this new wave of robotics is the adaptation of Vision-Language Models (VLMs) for motor control. Levine’s current model, π0, uses an open-source backbone like Google’s Gemma but grafts on what he calls a “motor cortex.” This “action expert” allows the model to process visual information and output continuous, high-frequency physical movements.

It is a single, end-to-end transformer that thinks in both text tokens and physical actions.

By using pre-trained weights from language models, these robots inherit a massive amount of “prior knowledge” about the world. They already know what a cup is, where a sink is located, and how gravity works. This allows the researchers to focus purely on the “bridge” between that abstract knowledge and the actual voltage required to move a gripper.

💡 Digging Deeper

Q: Why use diffusion and flow matching instead of discrete tokens for actions?
A: Physical actions are continuous and require extreme precision. Representing them as discrete “words” or tokens would lose the nuance needed for dexterous tasks like threading a needle or folding a box.

Q: Does the model need a long memory to function?
A: Surprisingly, no. Due to Moravec’s Paradox, many dexterous tasks are “in the moment.” While long-term planning requires hours of context, the act of picking up a shirt only requires about one second of visual memory to be effective.

The Economic and Hardware Explosion

The $3,000 Robot Arm

One of the most startling revelations is the rapid “learning rate” of robotic hardware costs. In 2014, a research robot cost $400,000; today, Physical Intelligence uses arms that cost $3,000, and that price is expected to continue falling. As AI becomes more “intelligent,” it requires less from the hardware. If a robot has perfect visual feedback, it doesn’t need expensive, ultra-precise joints because the software can compensate for mechanical wobble in real-time.

Cheap sensors and smart software are effectively “subsidizing” the cost of physical actuators.

This downward price pressure suggests a future where robots are as ubiquitous as smartphones. However, this creates a geopolitical challenge. Most of the supply chain for these arms, sensors, and actuators is currently centered in China. Levine argues that the US must invest in a “balanced ecosystem” that values both the AI “brain” and the hardware “body” to remain competitive.

Key Takeaways

The future of robotics is being built on the same foundations as ChatGPT, but with an added “motor cortex” that allows these models to interact with reality. We are moving away from the era of specialized robots toward general-purpose agents. This transition is being driven by the ability to leverage prior knowledge from the internet to solve physical problems that once required years of custom engineering.

The economic impact will be a massive boost in productivity, beginning with “human-in-the-loop” systems where robots augment human workers. Over the next five to ten years, as the data flywheel spins, we should expect a transition toward full automation in many blue-collar sectors. While the hardware supply chain remains a bottleneck, the plummeting cost of actuators and the rising intelligence of foundation models make a “robot in every home” a plausible reality by the 2030s.

Q&A

Q1: How does a robot learn “emergent” skills it wasn’t specifically trained for?
A1: Through compositional generalization. Because the models see a diverse range of behaviors, they learn to combine them in new ways—for example, a robot might learn to pick up a shopping bag that tipped over because it understands “picking up” and “uprightness” as separate, combinable concepts.

Q2: Will robots eventually learn from watching YouTube videos?
A2: Yes, but with a caveat. Watching a sport is not the same as playing it. Robots can use YouTube to understand what a task looks like, but they still need real-world “practice” to understand the tactile forces and timing involved.

Q3: Can we use simulation to train robots faster?
A3: Simulation is useful for “rehearsing” and considering counterfactuals, but it cannot inject new information about the messy, physical world. Real-world data remains the primary source of truth for the most difficult tasks.

Q4: What is Moravec’s Paradox in the context of AI?
A4: It is the observation that high-level reasoning (like chess or calculus) is computationally easy for AI, while low-level sensorimotor skills (like walking or folding a napkin) are incredibly difficult.

Q5: Will the “robotic brain” be separate from the “knowledge brain”?
A5: Sergey hopes they will merge. Understanding the physical world (e.g., “momentum”) provides the metaphors and grounding that humans use for abstract thought, and co-training them likely makes both the robot and the LLM smarter.

Q6: What is the “minimum package” for a useful robot?
A6: We are still figuring that out, but it likely involves two grippers, mobile bases, and robust visual feedback. We don’t need a “mechanical person”; we need the functional minimum that gets the job done reliably.

Q7: What is the best buffer against automation for human workers?
A7: Education. Not just learning facts, but developing the flexibility to acquire new skills. As tasks are automated, the ability to adapt to new roles within a higher-productivity economy will be the most valuable asset.

TeraBox Blog | 1TB Free Cloud Storage & All-in-One AI Space

Sergey Levine: The Future of Robotic Foundation Models

The Robotics Flywheel: Sergey Levine on the Path to General-Purpose Automation