Contrastive Learning

Today, let’s learn about…

Today, let’s learn about what is “contrastive learning”.

Link to a nice article: v7labs-contrastive-learning-guide

I’m just going to extract the essence of that article, and add my own explanations in this blog.

Hope you find it useful! Let’s get started!

Table of Contents:

What is Contrastive Learning?

Contrastive Learning is a Machine Learning paradigm where unlabeled data points are juxtaposed against each other (placed side by side) to teach a model:

Samples are placed side by side and contrasted against each other.

Those belonging to the same distribution are pushed towards each other in the embedding space. In contrast, those belonging to different distributions are pulled against each other.

The Importance of Contrastive Learning

There are many different machine learning paradigms. In general, we can categorize them into 3 categories:
1. Supervised Learning (have ground-truth labels)
2. Unsupervised Learning (You don’t have ground-truth labels)
3. Reinforcement Learning (Learning by interacting with the environment; learning from mistakes)

The problem with supervised learning is: it is heavily dependent on the quality of the labels – which is usually annotated by human, and thus it’s a labor-intensive and cumbersome process.

To mitigate the suffering from high annotation costs, researchers have been focusing on methods that does not require that much of a supervision. For example, Semi-supervised learning: first train a supervised model with small amount of labeled data. Then, use the trained model to generate the labels of the whole dataset (which doesn’t have labels in the beginning). We call the generated labels “pseudo ground-truth”. Then, use the full dataset with pseudo labels to train another model.

Another ML paradigm is called Self-supervised Learning (SSL), which does NOT require any pre-annotated labels. SSL uses the data itself to supervise the model in a way that “labels” are usually generated from data itself.
For example, in natural language processing (NLP), there is a task called “Masked Language Modeling (MLM)“, which is “predicting the missing words”. (missing words are randomly masked during training). For a simpler task: “predicting the next word in a sentence”, given an input sentence, the label(ground-truth/answer) of the prediction from current word is exactly the next word in that sentence. Same for autonomous driving end-to-end trajectory planning task. Given an input video, the label of the predicted trajectory can be found in the next frame in the same input video.

That’s why this method is called “self-supervised”.

One of the oldest and most popular techniques employed in SSL is Contrastive Learning. It uses “positive” and “negative” samples to guide the Deep Learning models.

Let us now discuss the working principle of Contrastive Learning.

How does Contrastive Learning work in Vision AI?

Contrastive Learning mimics the way humans learn. For example, we might not know what otters are or what grizzly bears are, but seeing the images (as shown below), we can at least infer which pictures show the same animals.

Source: https://www.v7labs.com/blog/contrastive-learning-guide

The basic contrastive learning framework is as follow:
1. Select a data sample — called “anchor”
2. All data points belonging to the same distribution as the anchor are called “positive” samples.
3. All data points belonging to a different distribution are called the “negative” samples.

The goal of the SSL model is :
– Minimize the distance between the anchor and positive samples.
– Maximize the distance between the anchor and negative samples.
Here, distances are computed in the feature space (embedding/latent space).

Source: Source: https://www.v7labs.com/blog/contrastive-learning-guide

As shown in the example above, two images belonging to the same class lie close to each other in the embedding space (“d+”), and those belonging to different classes lie at a greater distance from each other (“d-”). Thus, a contrastive learning model (denotes by “theta” in the example above) tries to minimize the distance “d+” and maximize the distance “d-.”

There are several techniques to select the positive and negative samples with respect to the anchor. You can refer to the article for more information.

There are so many things to discuss in CL, but what I’d like to discuss is the application in one of the most important foundation model — CLIP.

To be continued …….