Geometric Representation Learning

Abstract

Modern foundation models excel at semantics yet remain surprisingly fragile under the geometric transformations that govern the physical world, including rotations, scale changes, viewpoint shifts, and 3D rigid motions. The open question is how to bake these symmetry priors directly into representations rather than hoping data augmentation will recover them. Our research designs group-equivariant and self-supervised representation learning frameworks in which transformations of the input induce predictable, structured transformations of the feature space. We study invariance, equivariance, spherical and SO(3) harmonics, and symmetry-aware self-supervision as a single, unified toolkit. The downstream payoff is models that generalise from less data, behave consistently across viewpoints, and provide a geometrically grounded substrate for 3D perception, robotics, and multimodal AI.

Research Idea

Geometry is not a feature we tack on at the end of a pipeline; it is the language the world is written in. When a chair rotates, the pixels move in a structured way; when a camera shifts, every scene point transforms by the same rigid motion. We are interested in deep representations that respect these structures by construction, so that what the network learns about one viewpoint transfers automatically to all the others.

Concretely, we study four intertwined questions that the figure above tries to capture in one picture: when should a representation be invariant, when should it be equivariant, how do we supervise these properties without expensive labels, and how do we scale them up into modern transformer-based foundation models? If you find any of the threads below interesting, we would love to hear from you.

Invariance, Equivariance, and Symmetry as Inductive Bias

A good representation knows what to throw away and what to preserve. We treat invariance and equivariance as two halves of the same design principle: invariant features should ignore irrelevant transformations (a cat is still a cat upside down), while equivariant features should track the transformation so the geometry of the input survives in feature space. We are interested in group-equivariant convolutions and steerable architectures that hard-wire symmetries (rotation, reflection, scale, SE(2), SE(3)) into the layers themselves, as well as more flexible relaxations that learn approximate symmetries from data.

Recent work has shown that even very large transformers can be retrofitted with these priors, and we explore where the right trade-off lies between strict equivariance, soft equivariance, and architecture-agnostic symmetrisation.

Selected references

Group Equivariant Convolutional Networks (Cohen & Welling, ICML 2016) — https://arxiv.org/abs/1602.07576
General E(2)-Equivariant Steerable CNNs (e2cnn) (Weiler & Cesa, NeurIPS 2019) — https://arxiv.org/abs/1911.08251
Learning Probabilistic Symmetrization for Architecture Agnostic Equivariance (Kim et al., NeurIPS 2023) — https://arxiv.org/abs/2306.02866
Equivariant Neural Networks for General Linear Symmetries on Lie Algebras (Kim et al., 2025) — https://arxiv.org/abs/2510.22984

Spherical, SO(3), and Harmonic Representations for 3D

3D understanding is where geometric priors stop being a nice-to-have and start being essential. Pose, orientation, and viewpoint live on the rotation group SO(3), and naive Euclidean feature maps simply cannot represent them faithfully. We work with spherical CNNs, Wigner-D harmonic features, and SE(3)-equivariant transformers that operate in the frequency domain on the sphere, the same harmonic decomposition sketched on the right of the figure.

This direction connects directly to problems we care about: keypoint orientation, viewpoint estimation, point-cloud registration, and 3D pose regression. We are particularly interested in scaling these models to the size and data regimes of modern 3D foundation models while keeping their equivariance guarantees intact.

Selected references

Spherical CNNs (Cohen et al., ICLR 2018) — https://arxiv.org/abs/1801.10130
Learning SO(3) Equivariant Representations with Spherical CNNs (Esteves et al., ECCV 2018) — https://arxiv.org/abs/1711.06721
Scalable and Equivariant Spherical CNNs by Discrete-Continuous (DISCO) Convolutions (Ocampo et al., ICLR 2023) — https://arxiv.org/abs/2209.13603
SE3ET: SE(3)-Equivariant Transformer for Low-Overlap Point Cloud Registration (2024) — https://arxiv.org/abs/2407.16823
EquAct: SE(3)-Equivariant Multi-Task Transformer for Robotic Manipulation (2025) — https://arxiv.org/abs/2505.21351

Self-Supervised Consistency and Generalisation

Strict equivariance is one way to get a structured feature space; learned consistency is the other. The left side of the figure (augmentation, contrastive learning, cycle consistency, predictive coding) is our second lever. We study self-supervised objectives that ask the network to produce features which transform predictably under known image and 3D operations, so the geometry emerges from the loss rather than the layer.

We are interested in how far this can be pushed: can a model trained only with symmetry-aware self-supervision match the data efficiency and robustness of architectures with built-in equivariance? And can the same recipe scale to large multimodal and 3D foundation models, giving us the robust transfer the bottom of the figure promises? These are open questions, and we welcome students who like to sit at the interface of theory, computer vision, and large-scale representation learning.

Selected references

S-TREK: Sequential Translation and Rotation Equivariant Keypoints for Local Feature Extraction (Santellani et al., ICCV 2023) — https://arxiv.org/abs/2308.14598
Understanding the Role of Equivariance in Self-supervised Learning (Wang et al., NeurIPS 2024) — https://arxiv.org/abs/2411.06508
Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps (Mariotti et al., CVPR 2024) — https://arxiv.org/abs/2312.13216
Universal Few-shot Learning of Dense Prediction Tasks with Visual Token Matching (Kim et al., ICLR 2023) — https://arxiv.org/abs/2303.14969
Axis-level Symmetry Detection with Group-equivariant Representation (Yu et al., ICCV 2025) — https://arxiv.org/abs/2508.10204

Equivariant Transformers and Steerable Foundation Models

The next frontier, and the upper-right corner of the figure, is asking whether the symmetry priors that worked beautifully for small group-equivariant CNNs can scale to the size, data, and architectures of modern foundation models. Transformers were not designed with rotation or SE(3) symmetry in mind, yet they now dominate large-scale 3D modelling, molecular simulation, and multimodal perception. The result is a fast-moving line of work on steerable and equivariant attention, where higher-degree spherical harmonics, Wigner-D representations, and Clebsch-Gordan tensor products are folded directly into the transformer block.

We are interested in two practical questions. First, how to design equivariant transformers that retain provable symmetry guarantees while matching the throughput and expressivity of plain ViTs. Second, how to use these architectures as foundation backbones for downstream geometric tasks (3D perception, robot manipulation, molecular property prediction) so that equivariance is inherited rather than re-engineered for each task.

Selected references

EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations (Liao et al., ICLR 2024) — https://arxiv.org/abs/2306.12059
Steerable Transformers for Volumetric Data (Kundu & Kondor, 2024) — https://arxiv.org/abs/2405.15932
Equivariant Spherical Transformer for Efficient Molecular Modeling (An et al., 2025) — https://arxiv.org/abs/2505.23086
A Materials Foundation Model via Hybrid Invariant-Equivariant Architectures (Yan et al., 2025) — https://arxiv.org/abs/2503.05771

Interest

Geometric Representation Learning

Abstract

Research Idea

Invariance, Equivariance, and Symmetry as Inductive Bias

Spherical, SO(3), and Harmonic Representations for 3D

Self-Supervised Consistency and Generalisation

Equivariant Transformers and Steerable Foundation Models

Jongmin Lee

Assistant Professor of Computer Science Engineering