Computer Vision Course Resource

Self-Supervised Learning for Object Detection

A single-page visual guide for understanding how modern self-supervised vision backbones learn representations before they are transferred to detection tasks.

Unlabeled images
to
SSL backbone
to
Detector head
to
Boxes + classes
No labels first SSL learns image features from pretext signals such as views, masks, teachers, or target embeddings.
Backbone matters The pretrained encoder becomes a feature extractor for classification, segmentation, or object detection.
Detection still needs boxes After SSL pretraining, a detector is fine-tuned with labeled bounding boxes and class labels.
Teacher-student

DINOv2

Source

DINOv2 trains robust visual features without supervision by scaling self-distillation, curated data, and ViT backbones, then distilling strong teachers into smaller models.

Learns by

Matching student features to a teacher representation from augmented image views.

Detection transfer

Use the encoder as a pretrained backbone, then fine-tune a detection head on labeled boxes.

PyTorch focus

Teacher network, student network, centering/sharpening, and feature extraction.

Start with many class-relevant images. They do not need bounding boxes for SSL pretraining.
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision.models import vit_b_16

class ProjectionHead(nn.Module):
    def __init__(self, dim=768, out_dim=256):
        super().__init__()
        # dim=768 is the ViT-B/16 feature width.
        # out_dim=256 is the compact embedding used by the SSL loss.
        # These projection layers are trainable during SSL pretraining.
        self.net = nn.Sequential(
            nn.Linear(dim, 2048),
            nn.GELU(),
            nn.Linear(2048, out_dim)
        )

    def forward(self, x):
        return F.normalize(self.net(x), dim=-1)

class SSLBackbone(nn.Module):
    def __init__(self):
        super().__init__()
        # weights=None means the encoder starts from random weights.
        # Replace this with pretrained weights when fine-tuning a detector.
        self.encoder = vit_b_16(weights=None)
        # Remove the classifier because SSL needs representation vectors,
        # not ImageNet class logits.
        self.encoder.heads = nn.Identity()
        self.projector = ProjectionHead()

    def forward(self, images):
        features = self.encoder(images)
        return self.projector(features)

student = SSLBackbone()
teacher = SSLBackbone()
teacher.load_state_dict(student.state_dict())
for p in teacher.parameters():
    # Teacher parameters are not trained by backpropagation.
    # They are updated by exponential moving average of the student.
    p.requires_grad = False

ssl_hyperparams = {
    "lr": 1e-4,          # learning rate for trainable student parameters
    "weight_decay": 0.04, # regularizes large weights in ViT/projector
    "ema": 0.996,        # teacher update momentum; higher is smoother
    "temperature": 0.1   # sharpens similarity distributions
}
1. Replace or initialize

Backbone

The SSL encoder replaces the normal YOLO feature extractor, or initializes matching layers when the architectures are compatible.

2. Adapt feature maps

Neck

YOLO expects multi-scale feature maps. A small adapter or FPN/PAN neck converts SSL features into detection-friendly scales.

3. Predict objects

Head

The YOLO head remains supervised. It learns box coordinates, object confidence, and class probabilities from labeled annotations.

Input image
to
SSL backbone features
to
YOLO neck
to
YOLO detection head
to
Boxes, classes, confidence
In practice, YOLO backbones are usually CNN/CSP-style modules, while many SSL models are ViT-based. When shapes do not match directly, students should treat the SSL model as a feature extractor and add adapter layers before the YOLO neck. Freeze the SSL backbone first, train the neck/head, then unfreeze carefully with a smaller learning rate.
YOLO12 detection weights are available in five sizes: yolo12n.pt, yolo12s.pt, yolo12m.pt, yolo12l.pt, and yolo12x.pt. Choose nano/small for class demos and limited GPUs, medium/large/xlarge for higher accuracy when memory and time allow.
nNano: fastest, lowest memory
sSmall: balanced classroom default
mMedium: stronger accuracy
lLarge: slower, better features
xXlarge: most expensive option

Dataset YAML

path: /content/datasets/road_objects
train: images/train
val: images/val
test: images/test

names:
  0: person
  1: bicycle
  2: car
  3: bus

Install

pip install ultralytics

python -c "import ultralytics; ultralytics.checks()"
from ultralytics import YOLO

model_size = "s"  # choose from: "n", "s", "m", "l", "x"
model = YOLO(f"yolo12{model_size}.pt")

results = model.train(
    data="data.yaml",       # dataset file with train/val paths and class names
    epochs=100,             # full passes over the training set
    imgsz=640,              # input image size; larger can improve small objects
    batch=16,               # lower this if GPU memory is limited
    lr0=0.01,               # initial learning rate used by the YOLO optimizer
    weight_decay=0.0005,    # regularization for trainable weights
    device=0,               # GPU id; use "cpu" if no CUDA GPU is available
    project="cv_course_runs",
    name=f"yolo12{model_size}_objects"
)