Geometric Vision-Language Foundation Models

Example Research
In this line of research, we aim to integrate a geometric deep learning (GDL) module into an existing vision–language foundation model. The motivation behind this integration is that while vision–language foundation models (such as CLIP or multimodal LLMs) are powerful in capturing semantic associations between images and text, they often lack an explicit understanding of underlying geometric structures, such as spatial relations, 3D transformations, and multi-view consistency.
To address this gap, our approach introduces a GDL module that encodes equivariant and invariant representations of geometric transformations, including rotation, translation, and scaling, directly into the multimodal framework. By combining this module with the pre-trained backbone of the vision–language model, we aim to enrich the model’s capacity for spatial reasoning and 3D-aware perception.
For example, the integration enables the model to not only align images with textual descriptions but also understand how objects and scenes transform across viewpoints, thereby supporting tasks such as:
- 3D object recognition
- Camera pose estimation
- Visual correspondence
- Robotics perception
Moreover, the GDL-enhanced representations preserve geometric consistency across modalities, which can improve downstream performance in:
- AR/VR applications
- Embodied AI
- Scientific domains where geometric priors are critical
Overall, this research direction represents a step toward building geometrically grounded multimodal foundation models that go beyond statistical correlation, incorporating mathematical structure and spatial intelligence into the foundations of vision–language learning.