Computer Vision & Visual Geometry

Aug 4, 2025

Image credit: [ChatGPT]

Abstract: My research in computer vision and visual geometry focuses on building robust methods to understand and reconstruct the 3D world from visual data. This includes tackling fundamental challenges such as visual correspondence, object and camera pose estimation, multi-view geometry, structure-from-motion, optical flow, and 3D reconstruction. I am also interested in leveraging invariance and equivariance principles to design models that generalize across diverse viewpoints and object variations. Ultimately, my work aims to advance spatial reasoning capabilities, enabling reliable perception and interaction in complex real-world environments.

Research Idea

In this line of research, we aim to advance visual geometry and 3D vision methods that provide robust and interpretable understanding of the physical world. While deep learning has significantly improved perception tasks, current models often struggle with geometric consistency, such as maintaining alignment across multiple views or estimating precise 3D structures under occlusions and clutter.

Relavant Papers by JM (But, not limited)

3D Equivariant Pose Regression via Direct Wigner-D Harmonics Prediction, Jongmin Lee, Minsu Cho, NeurIPS 2024
Learning Rotation-Equivariant Features for Visual Correspondence, Jongmin Lee, Byungjin Kim, Seungwook Kim, Minsu Cho, CVPR 2023
Self-Supervised Equivariant Learning for Oriented Keypoint Detection, Jongmin Lee, Byungjin Kim, Minsu Cho, CVPR 2022
Self-supervised Learning of Image Scale and Orientation Estimation, Jongmin Lee, Yoonwoo Jeong, Minsu Cho, BMVC 2021
Learning to Distill Convolutional Features Into Compact Local Descriptors, Jongmin Lee, Yoonwoo Jeong, Seungwook Kim, Juhong Min, Minsu Cho, WACV 2021
Learning to Compose Hypercolumns for Semantic Visual Correspondence, Juhong Min, Jongmin Lee, Jean Ponce, Minsu Cho, ECCV 2020
SPair-71k: A Large-scale Benchmark for Semantic Correspondence, Juhong Min, Jongmin Lee, Jean Ponce, Minsu Cho, ArXiv 2019
Hyperpixel Flow: Semantic Correspondence with Multi-layer Neural Features, Juhong Min, Jongmin Lee, Jean Ponce, Minsu Cho, ICCV 2019
Idea: Depth induced SO(3) equivariant pose estimation

Relavant Papers about Geometric Reasoning

Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Mutimodal Models (Wang et al., CVPR 2025 highlight; Johns Hopkins)
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model (Liu et al., CVPR 2025; UW)
Bridging Viewpoint Gaps: Geometric Reasoning Boosts Semantic Correspondence (Qian et al., CVPR 2025; UC Berkeley)

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks: https://github.com/OpenGVLab/VisionLLM
BLINK : Multimodal Large Language Models Can See but Not Perceiv (ECCV 2024): https://zeyofu.github.io/blink/

Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs
Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching

Semantic matching with Stable diffusion

Tang et al., “Emergent Correspondence from Image Diffusion,” NeurIPS 2023.
Hedlin et al., “Unsupervised Semantic Correspondence Using Stable Diffusion,” NeurIPS 2023.
Zhang et al., “A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence,” NeurIPS 2023.
Luo et al., “Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence,” NeurIPS 2023.

CapeLLM- Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models, Kim et al., ICCV 2025

Orientation Estimation with Single-Image, with uncertainty estimation

U–ARE–ME- Uncertainty-Aware Rotation Estimation in Manhattan Environments
- Rethinking Inductive Biases for Surface Normal Estimation
idea: 연구 equivariance 써서 확장. 또는 디퓨전? Surface normal .. single-view camera pose estimation
- Digital IMU

Orientation Estimation with Foundation Models

ORIGEN- Zero-Shot 3D Orientation Grounding in Text-to-Image Generation
Vision Foundation Model Enables Generalizable Object Pose Estimation
Equivariant IMU Preintegration With Biases- A Galilean Group Approach

Extreme rotation estimation baselines

Extreme Rotation Estimation in the Wild (Bezalel et al., CVPR 2025)
Extreme Rotation Estimation using Dense Correlation Volumes (Cai et al., CVPR 2021)
Idea: Add SO(3) equivariant modules? e.g., spherical CNNs ..

SfM and monocular vision

MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion
GeoCalib: Learning Single-image Calibration with Geometric Optimization

Single-view pose estimation, and more…

Can Generative Video Models Help Pose Estimation? (Cai et al., CVPR 2025)

Extension of Equivariant Adaptation of Large Pretrained Models.

Equivariant Adaptation of Large Pretrained Models (Mondal et al., NIPS 2023)
- Idea: 3D equivariant foundation models (on CroCo, Dust3R)
  - 앞 뒤로 canonicalize 하는 모듈을 붙여서 SO(2) equivariance? (Plug-in module)
  - 또는 lifting 해서 SO(3) equivaraince?
  - 추가로, layer-wise equivariant LoRA 달기? (Plug-in-play)
    - Leveraging learned equivariance for generalization and solve the corner cases
SO(3)-Equivariant Representation Learning in 2D Images (Granberry et al., NeurIPS 2023W)

Interest

Computer Vision & Visual Geometry

Research Idea

Relavant Papers by JM (But, not limited)

Relavant Papers about Geometric Reasoning

Vision tasks Multi-modal LLMs (benchmarks)

Image keypoint matching with Multi-modal LLMs