Intriguing Properties of Contrastive Losses

Ting Chen

Google Research

Calvin Luo

Google Research

Lala Li

Google Research

Summary

Contrastive loss and its variants have become very popular recently for learning visual representations without supervision. In this work we study the effectiveness, limitations, and intriguing properties of existing contrastive learning methods. Our main findings and contributions are summarized as:

We propose a generalized contrastive loss, and show that differences between contrastive losses are small with a deep projection head.
We show that the instance-based objective widely used in existing contrastive learning methods can learn on images with multiple objects, and also learn meaningful local features despite operating on global image representation.
We construct three datasets with explicit and controllable competing features to systematically study the feature suppression effect in contrastive learning.
We show that a few bits of easy-to-learn shared features can suppress, and even fully prevent, the learning of other sets of competing features. In scenarios where there are multiple objects in an image, the dominant object would suppress the learning of smaller objects. This poses open challenges to existing contrastive learning.

Generalized Contrastive Loss

In this work, we generalize the standard contrastive loss to the following form:

The alignment loss encourages representations of augmented views to be consistent, while the distribution loss encourages representations (or a random subset of them) to match a prior distribution (of high entropy). The standard contrastive loss can then be re-written as follows (scaled by a constant τ):

More details regarding the generalized contrastive loss can be found in Section 2 of the paper. Furthermore, we compare a wide set of prior distributions, besides uniform prior in a hypersphere, and demonstrate that differences in performance across different generalized contrastive losses is small when using a deep projection head.

Instance-based objective can learn on images with multiple objects

Examples from the MultiDigits dataset. Images on the top are allow overlapping digits, while images on the bottom are generated in a non-overlapping fashion.

Commonly used self-supervised learning datasets, such as MNIST, CIFAR-10, ImageNet, are object centered, i.e. the image is mainly occupied by a single (dominant) object. To experiment with multiple objects in a controllable setting, we propose a new dataset called MultiDigits, that is generated by placing multiple MNIST digits (28 × 28 size) on a shared blank canvas (112 × 112 size). The placement of these digits can be done in overlapping or non-overlapping fashion, and is modulated using a hyperparameter.

We find that representations learned using supervised loss maintains its quality when up to 8 digits are placed in the image. After that the representation becomes worse as the canvas gets more crowded. Notably, representations learned using SimCLR display a similar phenomenon. Regardless of placement strategy, top-1 accuracy stays at the same level up to 8 digits, demonstrating that SimCLR can learn from images with multiple objects. In addition, the increased performance gap between the two placement strategies with increased number of digits shows that object overlapping makes it harder for contrastive losses to learn from multiple objects.

Instance-based objective can learn good local features

To understand the local features learned by SimCLR, we apply K-means on intermediate features of the SimCLR learned ResNet, and see how local regions of an image are grouped together. For good representations, we expect that regions of similar objects should be grouped together.

Specifically, we take a pretrained Resnet-50 2X model with selective kernels, and run inference on validation images from ImageNet and COCO. We run K-means with various numbers of clusters on the l2-normalized hidden features from middle layers of the network (e.g. block group 2,3,4 of the ResNet), and recolor the patches of the original image that the features represent. We also compare between SimCLR-learned features, supervised-learned features, as well as raw pixel (RGB) features extracted from the corresponding image patches. We find that SimCLR and supervised learning are able to discover meaningful local features. Furthermore, comparing ResNet intermediate features at different layers suggests that earlier layers contain more edge-related features, while later layers contain more object/part features.

Below, we provide a comparison across different methods and ResNet block groups for 2500 validation images from ImageNet and COCO, using bilinear interpolation on the recolorization grid. Visualization using nearest neighbor interpolation can also be toggled. A full Google Drive containing methods comparison for ImageNet and COCO, as well as block group comparison for ImageNet and COCO are linked.


Nearest Neighbors Interpolation ImageNet visualizations. Images on the left denote different methods, where images on the right denote different block groups.

Nearest Neighbors Interpolation Coco visualizations. Images on the left denote different methods, where images on the right denote different block groups.

Feature suppression poses an open challenge to contrastive learning research

Example images from the DigitOnImageNet dataset (top) and the Modified MultiDigits dataset (bottom).

In contrastive learning, the presence of "color distribution" features suppresses the competing features of "object classes". This is typically addressed by color augmentation; however, there may be scenarios where known augmentations cannot fully address this feature suppression effect. We quantitatively study the feature suppression phenomenon by constructing datasets with explicit and controllable competing features, and see how well contrastive learning methods continue to perform.

We create three datasets: DigitOnImageNet where MNIST digits are overlaid on ImageNet images via channel addition, Modified MultiDigits where only two digits are considered but one of them is of a varying size, and RandBit where a real image is concatenated with an image of a random integer in the channel dimension.

On the DigitOnImageNet dataset, we discover a performance trade-off between digit recognition ability and object recognition ability. This shows that simple features suppress the learning of difficult features, when both are shared between two augmented views. It is therefore difficult to learn both of the competing features using existing contrastive losses.

On the Modified MultiDigits dataset, we find that the learned representations of the smaller digit degenerate significantly when the size of the other digit increases, whereas the dominant object can still be learned well. Therefore, large objects can suppress the learning of features of smaller objects.

On the RandBit dataset, we observe that even adding a few bits of a competing feature will quickly destroy linear evaluation accuracy. The model quickly learns these few added bits, but then saturates in performance. Adding these bits thus compromises the learning of more generalizable features.

Citation

@article{chen2021intriguing,
  title={Intriguing Properties of Contrastive Losses},
  author={Chen, Ting and Luo, Calvin and Li, Lala},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}

We would like to thank Ajay Jain for the website template!