Geometric Vision-Language Foundation Models

Image credit: Unsplash

Example Research

In this line of research, we aim to integrate a geometric deep learning (GDL) module into an existing vision–language foundation model. The motivation behind this integration is that while vision–language foundation models (such as CLIP or multimodal LLMs) are powerful in capturing semantic associations between images and text, they often lack an explicit understanding of underlying geometric structures, such as spatial relations, 3D transformations, and multi-view consistency.

To address this gap, our approach introduces a GDL module that encodes equivariant and invariant representations of geometric transformations, including rotation, translation, and scaling, directly into the multimodal framework. By combining this module with the pre-trained backbone of the vision–language model, we aim to enrich the model’s capacity for spatial reasoning and 3D-aware perception.

For example, the integration enables the model to not only align images with textual descriptions but also understand how objects and scenes transform across viewpoints, thereby supporting tasks such as:

  • 3D object recognition
  • Camera pose estimation
  • Visual correspondence
  • Robotics perception

Moreover, the GDL-enhanced representations preserve geometric consistency across modalities, which can improve downstream performance in:

  • AR/VR applications
  • Embodied AI
  • Scientific domains where geometric priors are critical

Overall, this research direction represents a step toward building geometrically grounded multimodal foundation models that go beyond statistical correlation, incorporating mathematical structure and spatial intelligence into the foundations of vision–language learning.

Jongmin Lee
Jongmin Lee
Assistant Professor of Computer Science Engineering

My research focuses on computer vision and machine learning, with interests in visual geometry, 3D vision, and spatial reasoning with multi-modal LLMs. I explore applications in autonomous systems, AR/VR, robotics, and physical AI.