Multi-Modal Large Language Models

Abstract

Our research asks how large language models can be taught to perceive, reason, and decide across the same rich mix of modalities (images, audio and speech, video, and 3D scenes) that humans handle effortlessly every day. Despite rapid progress, today’s multi-modal models still hallucinate fine-grained details, reason inconsistently over long audiovisual contexts, and lack the geometric grounding needed for embodied tasks. We tackle these gaps along four interlocking threads, namely building open multi-modal foundation models, eliciting reliable chain-of-thought reasoning that uses both pixel and acoustic evidence, designing post-training recipes that couple SFT with preference optimization (DPO, GRPO), and injecting 3D and geometry-aware representations into the multi-modal stack. Together, we aim to develop general-purpose systems that parse a noisy conversation as patiently as a careful listener, watch a long video without losing the thread, and walk through a 3D scene as confidently as a robot. The downstream payoff spans assistive listening, scientific discovery, embodied agents, and any setting where seeing, hearing, and reasoning must work in lockstep.

Research Idea

We organize our work into four threads that build on one another, and we love when students move between them. First, we lay the groundwork with open multi-modal foundation models — covering images, audio, video, and 3D — that the academic community can actually train and study. On top of that base, we teach models to reason over what they see and hear with chains of thought that genuinely use sensory evidence. We then shape their behavior through post-training, blending supervised fine-tuning with preference and reinforcement-learning recipes. Finally, we ground everything in 3D, so our models can eventually see, hear, reason, and act in the physical world.

Multi-Modal Foundation Models

At the center of our research sits the multi-modal foundation model — a single system that ingests images, speech and ambient sound, music, video, and 3D scenes and turns them into language, structure, or action. We are interested in how to build these models openly and efficiently, so that academic labs can still meaningfully participate in the frontier rather than only consume it. That means rethinking data curation, joint visual and audio tokenization, dynamic resolution and sample rates, and the bridge between vision and audio encoders and language backbones.

Concretely, we explore compute-efficient pre-training recipes (sub-1K GPU-hour regimes), high-quality instruction data spanning both visual and acoustic tasks, and architectures that scale gracefully from single-image and short-clip inputs to multi-image, long-video, and minutes-long audio. We are equally interested in the evaluation side: where current open multi-modal LLMs systematically break (multilingual speech, environmental sound in reverberant scenes, music understanding, audiovisual alignment), and what new benchmarks would actually move the field. We welcome students who want to ship open models and the data that powers them.

Selected references

Visual Reasoning and Self-Correction

A foundation model is only as useful as the reasoning it can carry out on top of what it sees and hears. We are fascinated by sensory chain-of-thought — long, structured deliberation that genuinely uses pixel and acoustic evidence rather than papering over it with linguistic priors. Scientific figures, math diagrams, physical scenes, and noisy conversational audio all demand the same loop: look or listen, hypothesize, check, revise. The image at the top of this page deliberately shows that loop, including the moment the model says “Wait — potential failure of this reasoning step” and backtracks.

We explore how to elicit and verify this behavior: cold-start CoT data construction, process- versus outcome-level rewards, self-critique modules that catch the model’s own mistakes, and benchmarks that score how a model reasons, not just whether the final answer is right. We are especially interested in the open question of when visual reasoning genuinely transfers across domains.

Selected references

Post-Training: SFT, DPO, and GRPO

Pre-training gets a multi-modal model to know things; post-training gets it to behave. We work across the full post-training stack — supervised fine-tuning, Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO) — to align VLMs with human preferences, factual grounding, and step-by-step correctness. Following the DeepSeek-R1 line of work, we are particularly drawn to rule-based and verifiable rewards that let RL training scale without expensive human labeling.

We are exploring how preference and reasoning objectives interact when the model has to attend to pixels and audio rather than text alone: how to design verifiable multi-modal rewards (e.g., grounding IoU, transcription word error rate, audio-event match), how to curate preference data that is actually informative for perception, and how to balance helpfulness with hard-won reasoning gains. Students with a taste for reinforcement learning and a stomach for messy training curves will feel at home here.

Selected references

3D and Geometry-Aware Multi-Modal Understanding

Most VLMs treat the world as a flat grid of patches. But the scenes our models eventually have to act in — homes, labs, factories, streets — are emphatically three-dimensional. We are excited about pushing multi-modal LLMs from 2D pattern matching toward genuine 3D understanding: metric depth, object-level scene graphs, viewpoint consistency, and language that respects geometry.

We explore how to inject 3D inductive biases into otherwise standard VLM pipelines (depth plugins, 3D scene-graph supervision, point-cloud tokenizers), how to scale 3D vision-language data beyond the handful of curated indoor datasets, and how a 3D-grounded VLM can serve as the perception layer for embodied agents and robotics. This is the thread where our broader interest in spatial intelligence and our multi-modal LLM work converge.

Selected references

Jongmin Lee
Jongmin Lee
Assistant Professor of Computer Science Engineering

My research focuses on computer vision and machine learning, with interests in visual geometry, 3D vision, and spatial reasoning with multi-modal LLMs. I explore applications in autonomous systems, AR/VR, robotics, and physical AI.