TL;DR
We present Distribution Normalization, a test-time plug-in module that enhances CLIP’s performance on a wide range of zero-shot tasks with less than 5 lines of additional code.
Abstract
Advances in the field of vision-language contrastive learning have made it possible for many downstream applications to be carried out efficiently and accurately by simply taking the dot product between image and text representations.
One of the most representative approaches proposed recently known as CLIP has garnered widespread adoption due to its effectiveness. CLIP is trained with an InfoNCE loss that takes into account both positive and negative samples to help learn a much more robust representation space. This paper reveals that the common downstream practice of taking a dot product is only a zeroth-order approximation of the optimization goal, resulting in a loss of information during test-time. Intuitively, since the model has been optimized based on the InfoNCE loss, test-time procedures should also be in alignment. The question lies in how one can retrieve any semblance of negative samples information during inference in a computationally efficient way.
To this end, we propose Distribution Normalization (DN), where we approximate the mean representation of a batch of test samples and use such a mean to represent what would be analogous to negative samples in the InfoNCE loss. DN requires no retraining or fine-tuning and can be effortlessly applied during inference. Extensive experiments on a wide variety of downstream tasks exhibit a clear advantage of DN over the dot product on top of other existing test-time augmentation methods.
Motivation
The InfoNCE loss considers negative samples from the data distribution per positive sample and has been shown to be very effective for cross-modal representation learning. In contrast, during test-time, the similarity is often measured using the dot product between the image and the text representations, which does not utilize any information with regard to the data distribution.
Methodology
Our paper aims to rectify such misalignment, and we show that this boosts performance consistently across a variety of downstream tasks. However, naively applying InfoNCE loss for downstream tasks requires iterating over all negative samples for every test sample, which is not a tractable operation. Instead, we found that a first-order approximation of the InfoNCE loss, consisting of simply subtracting the distribution mean in the representation space before taking the dot product, is able to achieve a similar effect. We call this proposed approach Distribution Normalization (DN) — DN is very easy to implement and does not require any retraining, fine-tuning or labeled data.
While a more thorough proof is provided in our paper, we will briefly touch on the core idea here. In our analysis, we let $\mathcal{D}_S$ represent the training distribution and $\mathcal{D}_T$ represent the test distribution, $x_0, y_0$ represent an image and text pair, $\phi$ and $\psi$ represent image and text encoders respectively, and $\tau$ represent a temperature hyperparameter used in InfoNCE loss while training. Then, we observe a zero-order approximation of the InfoNCE loss can be expressed as
\[\mathcal{L}_{NCE}^{(0)}(\mathcal{D}_T) = 2n \cdot \mathbb{E}_{x_0, y_0 \sim \mathcal{D}_T} \left[e^{-\phi(x_0)^\intercal \psi(y_0)/\tau} \right],\]such that the downstream similarity score is computed as
\[S_{(0)}(x_0, y_0) = \phi(x_0)^\top \psi(y_0).\]However, if we can estimate the mean of the unlabeled distribution $\mu_x$ and $\mu_y$ for $\phi(x_1)$ and $\psi(y_1)$, we can perform a first-order approximation of the distribution with $\widehat{P}(\phi(x_1)) = \mathbb{I} {\phi(x_1) = \mu_x}$ and $\widehat{P}(\psi(y_1)) = \mathbb{I}{\psi(y_1) = \mu_y}$, so that $\widehat{P}(\phi(x_1)), \widehat{P}(\psi(y_1))$ matches the true distribution in terms of the first-order moment. We are thus left with
\[\mathcal{L}_{NCE}^{(1)}(\mathcal{D}_T) = 2n \cdot \mathbb{E}_{x_0, y_0 \sim \mathcal{D}_T} \left[e^{\phi(x_0)^\intercal (\mu_y - \psi(y_0))/\tau} + e^{(\mu_x - \phi(x_0))^\intercal \psi(y_0)/\tau} \right].\]We show that this can be nicely represented as geometric mean, such that using monotonicity of exponentiation, we get a more accurate similarity score
\[S_{(1)}(x_0, y_0) = (\phi(x_0) - \frac{1}{2}\mu_x)^\intercal (\psi(y_0) - \frac{1}{2}\mu_y).\]This translates directly to out implementation of DN, in which subtract off the mean of a test-time batch of image and text embeddings from each image and text embedding of the test set.
Results
We present a representative subsection of our results. For the full list, please see our paper!
Cross-Modal Retrieval
We first present results on cross-modal retrieval performance on MSCOCO in the zero-shot setting for more CLIP variants. As can be seen from the bolded entries, adding DN improves retrieval accuracy across the board for CLIP and CLIP+TTA. Means for DN* are estimated using 100 random unlabeled validation samples. Average recalls are calculated with 5 random seeds.
MSCOCO (5K test set) | ||||||
Image $\rightarrow$ Text | Text $\rightarrow$ Image | |||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |
CLIP | $52.4$ | $76.0$ | $84.5$ | $30.2$ | $55.1$ | $66.4$ |
CLIP + DN* | $52.9$ | $76.4$ | $84.9$ | $32.1$ | $57.4$ | $68.3$ |
CLIP + TTA + DN* | $\textbf{54.7}$ | $\textbf{77.8}$ | $\textbf{85.6}$ | $\textbf{33.8}$ | $\textbf{59.4}$ | $\textbf{70.1}$ |
Zero-shot Classification
Next, we present zero-shot classification performance on ImageNet1K, Cifar100, and SUN397. Means for DN are estimated using 100 random unlabeled validation samples. Average accuracies and standard deviations are calculated with 5 random seeds.
ImageNet1K | Cifar100 | SUN397 | ||||
Acc@1 | Acc@5 | Acc@1 | Acc@5 | Acc@1 | Acc@5 | |
CLIP | $61.0$ | $87.4$ | $63.9$ | $88.7$ | $56.1$ | $89.4$ |
CLIP + DN* | $61.7$ | $87.8$ | $65.1$ | $89.4$ | $57.3$ | $90.2$ |
CALIP | $61.2$ | $87.5$ | $64.2$ | $88.9$ | $56.1$ | $89.3$ |
TPT (Inefficient) | $\textbf{63.5}$ | $87.1$ | $65.2$ | $88.1$ | $\textbf{59.4}$ | $88.8$ |
CLIP + TTA | $62.4$ | $88.5$ | $66.0$ | $90.5$ | $56.9$ | $90.0$ |
CLIP + TTA + DN * | $63.2$ | $\textbf{88.9}$ | $\textbf{67.1}$ | $\textbf{90.7}$ | $58.1$ | $\textbf{90.7}$ |
Image Captioning Metrics
We see that on image captioning, adding DN improves upon existing baselines on Flicker8k-Expert, Flickr8k-CF, and THumb datasets.
Flickr8k-expert | Flickr8k-cf | THumb | ||
$\tau_c$ | $\tau_b$ | $\tau_c$ | ||
Ref-free | CLIP-ref | 51.4 | 34.3 | 19.9 |
CLIP + TTA | 51.9 | 34.7 | 20.7 | |
CLIP + TTA + DN | 53.6 | $\textbf{35.7}$ | 23.7 | |
CLIP + TTA + DN* | 53.5 | 35.5 | 22.7 | |
CLIP + DN | $\textbf{54.3}$ | 35.4 | $\textbf{23.5}$ | |
CLIP + DN* | 53.2 | 35.1 | 22.2 | |
Ref-based | BLEU-1 | 32.3 | 17.9 | 11.1 |
BLEU-4 | 30.8 | 16.9 | 6.9 | |
CIDEr | 43.9 | 24.6 | 13.8 | |
ViLBERTScore-F | 50.1 | - | - | |
CLIP-ref | 53.0 | 36.4 | 24.7 | |
CLIP + DN-ref | $\textbf{55.3}$ | $\textbf{37.0}$ | 25.8 | |
CLIP + DN*-ref | 54.3 | 36.9 | $\textbf{26.2}$ |
Authors
* indicates equal contribution.
Citation
@misc{zhou2023testtime,
title={Test-Time Distribution Normalization for Contrastively Learned Vision-language Models},
author={Yifei Zhou and Juntao Ren and Fengyu Li and Ramin Zabih and Ser-Nam Lim},
year={2023},
eprint={2302.11084},
archivePrefix={arXiv},
primaryClass={cs.LG}
}