Figure 1: The Intriguing Discovery of Stepwise Learning in Self-Supervised Systems. Dive into the groundbreaking analysis that reveals a stepwise descent of loss and dimensionality expansion in learned embeddings during the training of common SSL algorithms.
Deep learning’s remarkable achievements have been greatly attributed to its capability to unveil valuable representations of intricate data. Self-supervised learning (SSL) has emerged as a pivotal method for extracting these representations directly from unlabeled image data, akin to how LLMs extract language representations from web-scraped text. Despite SSL’s significance in cutting-edge models such as CLIP and MidJourney, fundamental questions regarding the learning process of self-supervised image systems remain unanswered.
Our recent study, set to be presented at ICML 2023, offers a groundbreaking mathematical insight into the training mechanism of large-scale SSL techniques. Through a simplified theoretical model, which we have precisely solved, we unveil that the learning process unfolds in a series of well-defined, discrete steps. This revelation not only paves the way for enhancing SSL methods but also opens up avenues for exploring new scientific inquiries that can shed light on the workings of today’s crucial deep learning systems.
Background
We primarily focus on joint-embedding SSL methods—encompassing contrastive techniques—that learn representations adhering to view-invariance principles. These models’ loss functions include a component that enforces consistent embeddings for semantically equivalent “views” of an image. Remarkably, this straightforward approach yields robust representations for image tasks, even with simple views like random crops and color variations.
Theory: Revealing Stepwise Learning in SSL with Linearized Models
Our initial step involves outlining a linear model of SSL that is exactly solvable, enabling us to derive closed-form expressions for both the training trajectories and final embeddings. Notably, we discover that representation learning unfolds in a series of discrete steps, where the rank of the embeddings starts small and gradually increases through a stepwise process.
The major theoretical breakthrough of our paper lies in precisely solving the training dynamics of the Barlow Twins loss function, under gradient flow, for the specific case of a linear model \(\mathbf{f}(\mathbf{x}) = \mathbf{W} \mathbf{x}\). By summarizing our findings, we uncover that, upon small initialization, the model learns representations that precisely correspond to the top-\(d\) eigenvectors of the featurewise cross-correlation matrix \(\boldsymbol{\Gamma} \equiv \mathbb{E}_{\mathbf{x},\mathbf{x}’} [ \mathbf{x} \mathbf{x}’^T ]\). Moreover, these eigenvectors are uncovered one by one in a sequence of discrete learning steps, determined by their associated eigenvalues. Figure 2 visually portrays this learning process, showcasing the emergence of a new direction in the represented function alongside the consequent decline in loss at each learning step. Intriguingly, our study of stepwise learning aligns with the broader concept of spectral bias, emphasizing the preference of learning systems for eigenvectors with higher eigenvalues.
The exploration of a linear model is crucial due to the linear dynamics exhibited by sufficiently wide neural networks, as evidenced by the neural tangent kernel (NTK) studies. This aspect allows us to extend our linear model solution to wide neural networks, where the model prioritizes learning the top \(d\) eigenvectors of an operator linked to the NTK. Given the insights gained from the NTK investigations, which have shed light on the training and generalization of nonlinear neural networks, our findings hint at possible transfers of understanding to practical scenarios.
Experiment: Evidencing Stepwise Learning in SSL with ResNets
Our primary experiments involve training several leading SSL methods with full-scale ResNet-50 encoders, revealing a clear manifestation of stepwise learning even in real-world settings. This observation suggests that stepwise behavior plays a central role in the learning patterns of SSL.
By monitoring the eigenvalues of the embedding covariance matrix over time, we unequivocally demonstrate stepwise learning with ResNets in realistic setups. Implementing slight modifications such as training from smaller-than-normal initialization and employing a reduced learning rate helps accentuate the stepwise behavior, as showcased in our experiments involving Barlow Twins, SimCLR, and VICReg on the STL-10 dataset. Strikingly, all three methods exhibit distinct stepwise learning patterns, with loss reduction following a staircase curve and a new eigenvalue emerging at each subsequent step. Our animated depiction of the early steps in Barlow Twins further emphasizes this noteworthy discovery.
While these methods may appear disparate on the surface, it has long been speculated that they converge on similar mechanisms. Our experiments, showcasing analogous stepwise learning across various SSL methods, underscore a unifying principle; SSL techniques incrementally learn embeddings one dimension at a time, progressively incorporating new dimensions based on salience.
Significance and Future Prospects
Our work unravels a fundamental theoretical framework elucidating how SSL methods construct learned representations throughout training. With this newfound understanding, we foresee practical applications that can enhance SSL’s efficacy and foster deeper insights into the realms of self-supervised learning and representation learning at large.
From a practical perspective, SSL models are notoriously slow to converge compared to supervised training regimes, with the underlying reasons remaining elusive. Our proposed training framework hints that the extended convergence time is attributed to the lengthy growth process of later embedding eigenvectors. If validated, accelerating SSL training could be as simple as targeting smaller embedding eigenvectors selectively for accelerated growth, accomplished through minor modifications to the loss function or optimizer. These potential improvements are further elaborated in our research paper.
On the scientific front, viewing SSL as an iterative process enables probing into individual eigenmodes, posing inquiries like the hierarchy of usefulness among learned eigenvectors, the impact of different augmentations on learned modes, and the assignment of semantic significance to these modes. Should different forms of representation learning converge on similar embeddings, as testable evidence suggests, the answers to these questions could have ramifications extending beyond deep learning.
In essence, our findings offer a stepping stone for future explorations into the learning dynamics of deep networks, signifying a promising avenue for delving deeper into the enigmatic realm of deep learning.
This blog post is based on the paper “On the Stepwise Nature of Self-Supervised Learning”, a collaborative effort with Maksis Knutins, Liu Ziyin, Daniel Geisz, and Joshua Albrecht. The research was conducted at Generally Intelligent, where Jamie Simon serves as a Research Fellow. This post is also available here. We welcome your questions and comments.