This repository explores the semantic richness of patch-level representations from Vision Transformer (ViT) models—including CLIP, DINO, MAE, and DINOv2—through linear probing on a semantic segmentation dataset, specifically Pascal VOC 2012.
We aim to answer the question: "How semantically meaningful are the patch embeddings of pretrained ViTs?"
Given a ViT model and the Pascal VOC segmentation dataset:
- Patch Labeling: Each patch is assigned a class via majority voting over its pixels (each pixel has a class label).
- Linear Probing: A linear classifier is trained to predict patch-level classes using frozen patch embeddings from the ViT model.
- Evaluation: We assess the learned layer’s performance on the Pascal VOC validation set.
The linear layer has shape (EMBED_DIM, NUM_CLASSES) and is trained using frozen features from the ViT backbone.
conda create --name probe_it python=3.9 -y
conda activate probe_it
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 \
--extra-index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt
This repo uses the Pascal VOC 2012 dataset.
- Download it from this link.
- Follow the MMSegmentation preparation guide to set it up properly.
To run a linear probing experiment:
python main.py --config path/to/config.yaml --out_dir results/
Make sure to define your backbone and training parameters in the YAML config.
Below are accuracy results of linear probing on various ViT backbones.
Training used: batch size = 16, Adam optimizer, learning rate = 5e-3, 3 epochs, and 32×32 patch resolution.
Visual Backbone | ViT-S | ViT-B | ViT-L |
---|---|---|---|
MAE | - | 0.61 | 0.62 |
CLIP | - | 0.91 | 0.90 |
DINO | 0.65 | 0.77 | - |
DINOv2 (without registers) | 0.95 | 0.96 | 0.95 |
DINOv2 (with registers) | 0.97 | 0.97 | 0.96 |
- Add support for other segmentation datasets (e.g., ADE20K, Cityscapes)
Feel free to open issues or submit pull requests for improvements, new backbones, or dataset integrations.
If this repo helped you, or you used it in your work, drop a star ⭐!