Application Research, AI+X

Abstract

Our application-oriented research asks how perception, geometry, and multi-modal reasoning can be transformed from offline benchmarks into systems that operate reliably in the physical world. We treat spatial intelligence as a shared core engine and pursue AI+X collaborations with domain experts across many verticals, including head-mounted displays, driving platforms, clinical workstations, civil and urban systems, and biology labs. Despite rapid progress on academic leaderboards, deployed settings still struggle with latency budgets, long-tail conditions, sensor heterogeneity, and safety constraints that benchmark accuracy alone cannot capture. Our group studies on-device spatial perception for AR/VR, planning-aware perception and occupancy world models for autonomous driving, trustworthy 3D segmentation for medical imaging, spatio-temporal and foundation-model forecasting for transportation and construction demand, and 3D vision for animal behavior, all sharing a backbone of multi-sensor fusion across camera, LiDAR, depth, and IMU. We treat robustness, calibration, and domain shift as first-class objectives rather than afterthoughts. The broader vision is a unified spatial-intelligence stack that lets the same geometric and semantic priors travel across consumer XR, mobility, healthcare, civil infrastructure, and the long tail of non-human subjects.

Research Idea

We position spatial intelligence as the core engine of our group and work in close partnership with domain experts on AI+X projects. Our role is to bring perception, geometry, and multi-modal reasoning; our collaborators bring the application context, the data, and the questions that matter in their fields, spanning AR/VR, autonomous driving, medical imaging, transportation and construction demand forecasting, and animal behavior. Each subsection below corresponds to one of those verticals, and we are actively looking for students and collaborators in all of them.

AR/VR: On-Device Spatial Perception and Neural Rendering

Mixed-reality headsets and smart glasses are becoming the next computing platform, and they demand a spatial stack that runs entirely on-device: meters-accurate tracking, photoreal rendering, eye-aware interfaces, and scene understanding that survives motion blur and aggressive viewpoint change. We explore how 3D Gaussian Splatting, neural fields, and structured scene representations can be compressed and accelerated for real-time SLAM and novel-view synthesis on power-budgeted hardware.

We are particularly interested in coupling eye tracking and gaze priors with geometric reconstruction, using where the user looks to drive foveated rendering, attention-conditioned scene parsing, and adaptive level of detail. Our long-term goal is a headset that reconstructs and understands a new room in seconds and behaves predictably the next time the user walks in.

Selected references

Autonomous Driving: Planning-Oriented Perception and Occupancy World Models

The center panel of the figure, the car surrounded by sensor cones and a BEV grid, captures a key open problem: how do we move past task-isolated perception toward representations that serve planning? We study end-to-end and modular driving stacks that share a single bird’s-eye-view or 3D-occupancy backbone across detection, mapping, motion forecasting, and trajectory planning, so that downstream uncertainty propagates cleanly back into upstream features.

A second thread is occupancy-based world models: forecasting how the surrounding 3D scene evolves, not just where current objects are. We are also interested in vision-language interfaces for driving, including graph and chain-of-thought reasoning grounded in scene tokens, so that future driving stacks are auditable, queryable, and able to handle corner cases the data distribution never contained.

Selected references

Medical Imaging AI: Trustworthy 3D Segmentation and Foundation Models

The right panel of the figure, with CT, MRI, PET, and X-ray volumes, points to a domain where accuracy is not optional and where models must transfer across scanners, protocols, and patient populations. We study 3D segmentation and detection pipelines that build on the strongest open backbones (the nnU-Net family and SAM-style foundation models for medical images), and we focus on the parts of the problem that most often break in clinical practice: domain shift, calibration, federated training across hospitals, and segmentation of long-tail anatomy.

Our research interest is in making these models honest, well-calibrated, uncertainty-aware, and able to flag out-of-distribution scans rather than hallucinate a confident mask. We welcome collaboration with radiology and clinical partners who want to take vision research from the leaderboard to the reading room.

Selected references

Sensor Fusion and Real-World Robustness

Cutting across all three pillars is the bottom row of the figure: camera + LiDAR + IMU fusion, robustness to latency, safety, and domain shift, and deployment in real-world systems. We treat sensor fusion as a unifying lens, the same geometric and uncertainty-aware machinery that aligns multi-camera rigs in a headset also aligns LiDAR and cameras on a vehicle, and the same domain-shift tools that protect a driving stack at night protect a segmentation model on a new scanner.

If you are excited about pushing spatial AI all the way from a research notebook to a device a user actually wears, drives, or scans with, we would love to hear from you.

Selected references

Transportation and Construction Demand Forecasting

A growing thread of our AI+X work happens with civil-engineering and urban-systems collaborators who study transportation networks, construction projects, and supply chains. Together we are exploring how spatial AI, spatio-temporal graph models, and time-series foundation models can be applied to demand forecasting at city scale, from passenger and freight flow on a road or rail network, to ridership at individual stations, to project demand and material or supply-chain volumes for construction. The lab does not yet have published work in this vertical, so we approach it as an open direction rather than a claim of prior expertise.

What makes this collaboration productive is access to real-world data through our partners (operator logs, sensor feeds, project pipelines, regional statistics) paired with our experience in building large foundation models that ingest heterogeneous, multimodal signals. We are interested in zero-shot and few-shot forecasting, in spatial graph structure that respects the underlying transportation or supply network, and in calibrated uncertainty for decisions that affect schedules, budgets, and infrastructure.

Selected references

Animal Behavior and Pose Estimation

A second AI+X direction takes the lab’s spatial-intelligence toolbox (equivariance, feed-forward 3D foundation models, multi-view geometry, sparse correspondence matching) and applies it to non-human subjects, with a particular focus on avian species, especially migratory shorebirds and waterfowl, endangered raptors, regional endemics across the East Asian–Australasian Flyway, and the long tail of bird diversity that conventional human-centric models almost never see. The work also extends to mammals and other taxa where the same geometric and category-aware ideas transfer. We partner with neuroscience, ecology, and wildlife collaborators in settings where labeled data is scarce, viewpoints are extreme (overhead camera traps, fisheye enclosures, telephoto wildlife footage), articulation is highly dexterous, and appearance shifts wildly between species. These are exactly the regimes in which generic human-centric pose and reconstruction models break, and where category-aware, equivariant, and few-shot methods earn their keep.

We are particularly excited about feed-forward animal reconstruction from casual video, family-aware shape and pose estimation across taxa, and 3D Gaussian-splatting style models that can quantify pose and appearance for longitudinal behavior studies. This direction connects naturally with the broader CV4Animals workshop community, with whom we share the goal of making 3D vision tools usable by biologists, not just ML researchers.

Selected references

Jongmin Lee
Jongmin Lee
Assistant Professor of Computer Science Engineering

My research focuses on computer vision and machine learning, with interests in visual geometry, 3D vision, and spatial reasoning with multi-modal LLMs. I explore applications in autonomous systems, AR/VR, robotics, and physical AI.