Geometric Vision-Language Foundation Models

Sep 1, 2021

Image credit: Unsplash

Example Research

In this line of research, we aim to integrate a geometric deep learning (GDL) module into an existing vision–language foundation model. The motivation behind this integration is that while vision–language foundation models (such as CLIP or multimodal LLMs) are powerful in capturing semantic associations between images and text, they often lack an explicit understanding of underlying geometric structures, such as spatial relations, 3D transformations, and multi-view consistency.

To address this gap, our approach introduces a GDL module that encodes equivariant and invariant representations of geometric transformations, including rotation, translation, and scaling, directly into the multimodal framework. By combining this module with the pre-trained backbone of the vision–language model, we aim to enrich the model’s capacity for spatial reasoning and 3D-aware perception.

For example, the integration enables the model to not only align images with textual descriptions but also understand how objects and scenes transform across viewpoints, thereby supporting tasks such as:

3D object recognition
Camera pose estimation
Visual correspondence
Robotics perception

Moreover, the GDL-enhanced representations preserve geometric consistency across modalities, which can improve downstream performance in:

AR/VR applications
Embodied AI
Scientific domains where geometric priors are critical

Overall, this research direction represents a step toward building geometrically grounded multimodal foundation models that go beyond statistical correlation, incorporating mathematical structure and spatial intelligence into the foundations of vision–language learning.

Interest

Geometric Vision-Language Foundation Models

Example Research

Jongmin Lee

Assistant Professor of Computer Science Engineering