Robotics for Real-World Physical AI

Abstract

Our research asks how an autonomous agent can perceive, reason about, and act in the unstructured physical world with the same fluency that modern foundation models exhibit on the web. Despite rapid progress on tabletop benchmarks, today’s robotic systems still degrade sharply under viewpoint change, novel object geometry, and the long horizons that real households or factories demand, and the data needed to bridge that gap is orders of magnitude scarcer than vision or language corpora. We address this by unifying 3D geometric perception, language-conditioned policy learning, and large-scale simulation into a single stack, with explicit attention to equivariance, multimodal grounding, and the closing of the sim-to-real gap. The lab integrates the five pillars shown above (Perception, Planning, Control, Learning, and Vision-Language-Action) so that a single agent can map raw sensor streams to robust closed-loop behavior. Our broader aim is to turn embodied AI from a controlled demonstration into a dependable everyday capability across manipulation, mobile, and humanoid platforms.

Research Idea

A robot that is genuinely useful in the real world must see in 3D, understand what a person is asking, and translate that intent into safe, dexterous motion under uncertainty. We build that full stack, from depth-aware perception, through language-conditioned policies, to scalable training in simulation, and we study the interfaces between these layers, because that is where today’s systems still break. If you care about turning learning algorithms into embodied behavior, the questions below are the ones we are actively chasing with prospective students and collaborators.

3D Perception and Spatial Reasoning for Manipulation

A manipulator can only grasp what it can geometrically localize. We are interested in perception modules that recover dense 3D structure from RGB-D streams, fuse multi-view observations into a coherent scene, and pass that representation directly into a downstream policy rather than collapsing it to a 2D image. Our work emphasises equivariant and viewpoint-consistent representations so that an agent trained in one workspace continues to behave correctly when the camera, lighting, or object placement shifts.

We are particularly drawn to architectures where 3D scene tokens, point clouds, or neural fields become first-class inputs to a learned controller. This is the bridge between the “Perception” and “Control” pillars in the figure above, and we believe much of the remaining gap to robust manipulation lives precisely on that bridge.

Selected references

3D Diffuser Actor: Policy Diffusion with 3D Scene Representations (Ke et al., 2024) — https://arxiv.org/abs/2402.10885
SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation (Wang et al., ICLR 2024) — https://arxiv.org/abs/2310.16838
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion (Chi et al., RSS 2023 / IJRR 2025) — https://arxiv.org/abs/2303.04137

Planning and Long-Horizon Decision Making

The “PLANNING” pillar in the figure above sits between high-level language intent and the low-level controller: a usable household or factory agent has to decompose “set the table” into an ordered sequence of grounded subgoals, monitor whether each step succeeded, and recover when it did not. We study how large language and vision-language models can act as planners that propose feasible action sketches, then how those sketches are verified against the robot’s actual affordances and 3D scene state rather than hallucinated capabilities.

We are especially interested in closed-loop planning, where the planner reads back perception updates and edits its own plan mid-execution, and in code-generating planners that emit verifiable programs over a fixed skill library. This is where the “Vision-Language-Action” and “Control” pillars meet in the figure, and it is the layer that turns a strong policy into a dependable, multi-step agent.

Selected references

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan) (Ahn et al., CoRL 2022) — https://arxiv.org/abs/2204.01691
Code as Policies: Language Model Programs for Embodied Control (Liang et al., ICRA 2023) — https://arxiv.org/abs/2209.07753
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models (Huang et al., CoRL 2023) — https://arxiv.org/abs/2307.05973
Inner Monologue: Embodied Reasoning through Planning with Language Models (Huang et al., CoRL 2022) — https://arxiv.org/abs/2207.05608
AutoRT: Embodied Foundation Models for Large-Scale Orchestration of Robotic Agents (DeepMind Robotics, 2024) — https://arxiv.org/abs/2401.12963

Vision-Language-Action Models and Generalist Policies

The language overlay in the figure (“Pick up the block”) represents a fundamental shift: instructions are no longer hard-coded skills but free-form natural language grounded in what the robot sees. We explore Vision-Language-Action (VLA) models that inherit semantic priors from internet-scale pretraining and then learn to emit robot actions, so that a single policy can be steered across tasks, embodiments, and scenes by language alone.

We are interested in three open questions: how to make these models faster and cheaper to run on real hardware, how to fine-tune them with only a handful of demonstrations, and how to make their reasoning interpretable enough that an operator can trust them outside the lab.

Selected references

OpenVLA: An Open-Source Vision-Language-Action Model (Kim et al., CoRL 2024) — https://arxiv.org/abs/2406.09246
π0: A Vision-Language-Action Flow Model for General Robot Control (Physical Intelligence, 2024) — https://arxiv.org/abs/2410.24164
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control (Brohan et al., CoRL 2023) — https://arxiv.org/abs/2307.15818
Helix: A Vision-Language-Action Model for Generalist Humanoid Control (Figure, 2025) — https://www.figure.ai/news/helix

Learning Dexterous and Bimanual Manipulation

The arm in the image is performing a deceptively simple grasp; the harder version of that task, folding laundry, pouring liquid, threading a cable, still defeats most learned policies. We work on imitation learning, diffusion-based action generation, and reinforcement fine-tuning for high-frequency, contact-rich control, with a focus on bimanual and eventually humanoid embodiments.

We are excited by policies that explicitly model action multimodality, that can recover from their own mistakes mid-trajectory, and that scale gracefully when more data or more arms are added. If you have ever wondered why a 7B-parameter model can write a sonnet but cannot reliably close a zip-lock bag, this is the chapter we want to write together.

Selected references

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation (Liu et al., ICLR 2025) — https://arxiv.org/abs/2410.07864
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots (NVIDIA, 2025) — https://arxiv.org/abs/2503.14734
ALOHA Unleashed: A Simple Recipe for Robot Dexterity (Zhao et al., CoRL 2024) — https://arxiv.org/abs/2410.13126

Self-Supervised World Models for Embodied Agents

If a robot could imagine the consequences of its own actions before executing them, most of the data problem in robotics would go away. That is the promise of self-supervised world models: train a network to predict what happens next, in a learned latent space rather than at the pixel level, and you get a representation that already understands objects, contact, and motion before a single reward signal is added. The JEPA family (I-JEPA on images, V-JEPA and V-JEPA 2 on video) is the clearest expression of this idea, and we find it compelling because latent prediction sidesteps the wasted modeling capacity that per-frame pixel prediction spends on textures, lighting, and other details the controller never needs.

We are especially interested in action-conditioned world models, where the network learns p(future | past, action) and a policy can be optimised by planning or by rolling out trajectories inside the model. V-JEPA 2 has now been used directly to drive zero-shot pick-and-place on real Franka arms after training on unlabeled robot video, which is the strongest signal yet that this recipe transfers off the benchmark. In parallel, foundation-scale world models like DeepMind’s Genie and Genie 2, and NVIDIA’s Cosmos platform, are pushing toward generative, controllable environments that can be sampled from and steered with language or actions; we want to understand when these large simulators are the right substrate for policy learning versus the leaner latent predictors above.

Open questions we are chasing with collaborators: how to keep the predictor honest about geometry and contact (not just visual plausibility), how to combine a JEPA-style encoder with a downstream VLA, and how much real-world interaction is actually needed once a strong video world model is in place.

Selected references

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning (Assran et al., 2025) — https://arxiv.org/abs/2506.09985
Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA) (Bardes et al., 2024) — https://arxiv.org/abs/2404.08471
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA) (Assran et al., CVPR 2023) — https://arxiv.org/abs/2301.08243
DreamerV3: Mastering Diverse Domains through World Models (Hafner et al., Nature 2025) — https://arxiv.org/abs/2301.04104
Cosmos World Foundation Model Platform for Physical AI (NVIDIA, 2025) — https://arxiv.org/abs/2501.03575
Genie 2: A Large-Scale Foundation World Model (DeepMind, 2024) — https://deepmind.google/blog/genie-2-a-large-scale-foundation-world-model/

Simulation and Sim-to-Real Transfer

Real robot data is expensive, slow, and brittle. Alongside the latent world models above, we invest heavily in the other side of the loop: large-scale physics simulation and synthetic data generation from a handful of human demonstrations, so that an agent can rehearse behaviors before touching anything physical. The goal is a training pipeline where most experience comes from a simulator, the world model fills in the parts the simulator gets wrong, and only the final calibrating fraction comes from the real arm.

We are interested in faithful contact physics, domain randomization grounded in real sensor statistics, and policies that transfer zero-shot or with minimal real-world adaptation. This closes the loop with the perception and VLA work above; together they form the integrated pipeline depicted in the figure.

Selected references

MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations (Mandlekar et al., CoRL 2023) — https://arxiv.org/abs/2310.17596
Open X-Embodiment: Robotic Learning Datasets and RT-X Models (Open X-Embodiment Collaboration, ICRA 2024) — https://arxiv.org/abs/2310.08864
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots (Nasiriany et al., RSS 2024) — https://arxiv.org/abs/2406.02523

Interest

Robotics for Real-World Physical AI

Abstract

Research Idea

3D Perception and Spatial Reasoning for Manipulation

Planning and Long-Horizon Decision Making

Vision-Language-Action Models and Generalist Policies

Learning Dexterous and Bimanual Manipulation

Self-Supervised World Models for Embodied Agents

Simulation and Sim-to-Real Transfer

Jongmin Lee

Assistant Professor of Computer Science Engineering