Computer Vision & Visual Geometry

Image credit: [ChatGPT]

Abstract: My research in computer vision and visual geometry focuses on building robust methods to understand and reconstruct the 3D world from visual data. This includes tackling fundamental challenges such as visual correspondence, object and camera pose estimation, multi-view geometry, structure-from-motion, optical flow, and 3D reconstruction. I am also interested in leveraging invariance and equivariance principles to design models that generalize across diverse viewpoints and object variations. Ultimately, my work aims to advance spatial reasoning capabilities, enabling reliable perception and interaction in complex real-world environments.

Research Idea

In this line of research, we aim to advance visual geometry and 3D vision methods that provide robust and interpretable understanding of the physical world. While deep learning has significantly improved perception tasks, current models often struggle with geometric consistency, such as maintaining alignment across multiple views or estimating precise 3D structures under occlusions and clutter.

Relavant Papers by JM (But, not limited)

  • 3D Equivariant Pose Regression via Direct Wigner-D Harmonics Prediction, Jongmin Lee, Minsu Cho, NeurIPS 2024

  • Learning Rotation-Equivariant Features for Visual Correspondence, Jongmin Lee, Byungjin Kim, Seungwook Kim, Minsu Cho, CVPR 2023

  • Self-Supervised Equivariant Learning for Oriented Keypoint Detection, Jongmin Lee, Byungjin Kim, Minsu Cho, CVPR 2022

  • Self-supervised Learning of Image Scale and Orientation Estimation, Jongmin Lee, Yoonwoo Jeong, Minsu Cho, BMVC 2021

  • Learning to Distill Convolutional Features Into Compact Local Descriptors, Jongmin Lee, Yoonwoo Jeong, Seungwook Kim, Juhong Min, Minsu Cho, WACV 2021

  • Learning to Compose Hypercolumns for Semantic Visual Correspondence, Juhong Min, Jongmin Lee, Jean Ponce, Minsu Cho, ECCV 2020

  • SPair-71k: A Large-scale Benchmark for Semantic Correspondence, Juhong Min, Jongmin Lee, Jean Ponce, Minsu Cho, ArXiv 2019

  • Hyperpixel Flow: Semantic Correspondence with Multi-layer Neural Features, Juhong Min, Jongmin Lee, Jean Ponce, Minsu Cho, ICCV 2019

  • Idea: Depth induced SO(3) equivariant pose estimation

Relavant Papers about Geometric Reasoning

  • Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Mutimodal Models (Wang et al., CVPR 2025 highlight; Johns Hopkins)
  • Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model (Liu et al., CVPR 2025; UW)
  • Bridging Viewpoint Gaps: Geometric Reasoning Boosts Semantic Correspondence (Qian et al., CVPR 2025; UC Berkeley)

Vision tasks Multi-modal LLMs (benchmarks)

Image keypoint matching with Multi-modal LLMs

  • Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs
  • Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching

Semantic matching with Stable diffusion

  • Tang et al., “Emergent Correspondence from Image Diffusion,” NeurIPS 2023.
  • Hedlin et al., “Unsupervised Semantic Correspondence Using Stable Diffusion,” NeurIPS 2023.
  • Zhang et al., “A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence,” NeurIPS 2023.
  • Luo et al., “Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence,” NeurIPS 2023.

Pose estimation with Multi-modal LLMs

  • CapeLLM- Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models, Kim et al., ICCV 2025

Orientation Estimation with Single-Image, with uncertainty estimation

  • U–ARE–ME- Uncertainty-Aware Rotation Estimation in Manhattan Environments

    • Rethinking Inductive Biases for Surface Normal Estimation
  • idea: 연구 equivariance 써서 확장. 또는 디퓨전? Surface normal .. single-view camera pose estimation

    • Digital IMU

Orientation Estimation with Foundation Models

  • ORIGEN- Zero-Shot 3D Orientation Grounding in Text-to-Image Generation
  • Vision Foundation Model Enables Generalizable Object Pose Estimation
  • Equivariant IMU Preintegration With Biases- A Galilean Group Approach

Extreme rotation estimation baselines

  • Extreme Rotation Estimation in the Wild (Bezalel et al., CVPR 2025)
  • Extreme Rotation Estimation using Dense Correlation Volumes (Cai et al., CVPR 2021)
  • Idea: Add SO(3) equivariant modules? e.g., spherical CNNs ..

SfM and monocular vision

  • MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion
  • GeoCalib: Learning Single-image Calibration with Geometric Optimization

Single-view pose estimation, and more…

  • Can Generative Video Models Help Pose Estimation? (Cai et al., CVPR 2025)

Extension of Equivariant Adaptation of Large Pretrained Models.

  • Equivariant Adaptation of Large Pretrained Models (Mondal et al., NIPS 2023)
    • Idea: 3D equivariant foundation models (on CroCo, Dust3R)
      • 앞 뒤로 canonicalize 하는 모듈을 붙여서 SO(2) equivariance? (Plug-in module)
      • 또는 lifting 해서 SO(3) equivaraince?
      • 추가로, layer-wise equivariant LoRA 달기? (Plug-in-play)
        • Leveraging learned equivariance for generalization and solve the corner cases
  • SO(3)-Equivariant Representation Learning in 2D Images (Granberry et al., NeurIPS 2023W)
Jongmin Lee
Jongmin Lee
Assistant Professor of Computer Science Engineering

My research focuses on computer vision and machine learning, with interests in visual geometry, 3D vision, and spatial reasoning with multi-modal LLMs. I explore applications in autonomous systems, AR/VR, robotics, and physical AI.