Geometric Representation Learning

Abstract

Modern foundation models excel at semantics yet remain surprisingly fragile under the geometric transformations that govern the physical world, including rotations, scale changes, viewpoint shifts, and 3D rigid motions. The open question is how to bake these symmetry priors directly into representations rather than hoping data augmentation will recover them. Our research designs group-equivariant and self-supervised representation learning frameworks in which transformations of the input induce predictable, structured transformations of the feature space. We study invariance, equivariance, spherical and SO(3) harmonics, and symmetry-aware self-supervision as a single, unified toolkit. The downstream payoff is models that generalise from less data, behave consistently across viewpoints, and provide a geometrically grounded substrate for 3D perception, robotics, and multimodal AI.

Research Idea

Geometry is not a feature we tack on at the end of a pipeline; it is the language the world is written in. When a chair rotates, the pixels move in a structured way; when a camera shifts, every scene point transforms by the same rigid motion. We are interested in deep representations that respect these structures by construction, so that what the network learns about one viewpoint transfers automatically to all the others.

Concretely, we study four intertwined questions that the figure above tries to capture in one picture: when should a representation be invariant, when should it be equivariant, how do we supervise these properties without expensive labels, and how do we scale them up into modern transformer-based foundation models? If you find any of the threads below interesting, we would love to hear from you.

Invariance, Equivariance, and Symmetry as Inductive Bias

A good representation knows what to throw away and what to preserve. We treat invariance and equivariance as two halves of the same design principle: invariant features should ignore irrelevant transformations (a cat is still a cat upside down), while equivariant features should track the transformation so the geometry of the input survives in feature space. We are interested in group-equivariant convolutions and steerable architectures that hard-wire symmetries (rotation, reflection, scale, SE(2), SE(3)) into the layers themselves, as well as more flexible relaxations that learn approximate symmetries from data.

Recent work has shown that even very large transformers can be retrofitted with these priors, and we explore where the right trade-off lies between strict equivariance, soft equivariance, and architecture-agnostic symmetrisation.

Selected references

Spherical, SO(3), and Harmonic Representations for 3D

3D understanding is where geometric priors stop being a nice-to-have and start being essential. Pose, orientation, and viewpoint live on the rotation group SO(3), and naive Euclidean feature maps simply cannot represent them faithfully. We work with spherical CNNs, Wigner-D harmonic features, and SE(3)-equivariant transformers that operate in the frequency domain on the sphere, the same harmonic decomposition sketched on the right of the figure.

This direction connects directly to problems we care about: keypoint orientation, viewpoint estimation, point-cloud registration, and 3D pose regression. We are particularly interested in scaling these models to the size and data regimes of modern 3D foundation models while keeping their equivariance guarantees intact.

Selected references

Self-Supervised Consistency and Generalisation

Strict equivariance is one way to get a structured feature space; learned consistency is the other. The left side of the figure (augmentation, contrastive learning, cycle consistency, predictive coding) is our second lever. We study self-supervised objectives that ask the network to produce features which transform predictably under known image and 3D operations, so the geometry emerges from the loss rather than the layer.

We are interested in how far this can be pushed: can a model trained only with symmetry-aware self-supervision match the data efficiency and robustness of architectures with built-in equivariance? And can the same recipe scale to large multimodal and 3D foundation models, giving us the robust transfer the bottom of the figure promises? These are open questions, and we welcome students who like to sit at the interface of theory, computer vision, and large-scale representation learning.

Selected references

Equivariant Transformers and Steerable Foundation Models

The next frontier, and the upper-right corner of the figure, is asking whether the symmetry priors that worked beautifully for small group-equivariant CNNs can scale to the size, data, and architectures of modern foundation models. Transformers were not designed with rotation or SE(3) symmetry in mind, yet they now dominate large-scale 3D modelling, molecular simulation, and multimodal perception. The result is a fast-moving line of work on steerable and equivariant attention, where higher-degree spherical harmonics, Wigner-D representations, and Clebsch-Gordan tensor products are folded directly into the transformer block.

We are interested in two practical questions. First, how to design equivariant transformers that retain provable symmetry guarantees while matching the throughput and expressivity of plain ViTs. Second, how to use these architectures as foundation backbones for downstream geometric tasks (3D perception, robot manipulation, molecular property prediction) so that equivariance is inherited rather than re-engineered for each task.

Selected references

Jongmin Lee
Jongmin Lee
Assistant Professor of Computer Science Engineering

My research focuses on computer vision and machine learning, with interests in visual geometry, 3D vision, and spatial reasoning with multi-modal LLMs. I explore applications in autonomous systems, AR/VR, robotics, and physical AI.