Self-Supervised Learning for Object Detection

What Students Should Notice

No labels first SSL learns image features from pretext signals such as views, masks, teachers, or target embeddings.

Backbone matters The pretrained encoder becomes a feature extractor for classification, segmentation, or object detection.

Detection still needs boxes After SSL pretraining, a detector is fine-tuned with labeled bounding boxes and class labels.

Interactive Model Explorer

Teacher-student

DINOv2

Source

DINOv2 trains robust visual features without supervision by scaling self-distillation, curated data, and ViT backbones, then distilling strong teachers into smaller models.

Learns by

Matching student features to a teacher representation from augmented image views.

Detection transfer

Use the encoder as a pretrained backbone, then fine-tune a detection head on labeled boxes.

PyTorch focus

Teacher network, student network, centering/sharpening, and feature extraction.

Build the Detection Pipeline

Start with many class-relevant images. They do not need bounding boxes for SSL pretraining.

PyTorch Code Walkthrough

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision.models import vit_b_16

class ProjectionHead(nn.Module):
    def __init__(self, dim=768, out_dim=256):
        super().__init__()
        # dim=768 is the ViT-B/16 feature width.
        # out_dim=256 is the compact embedding used by the SSL loss.
        # These projection layers are trainable during SSL pretraining.
        self.net = nn.Sequential(
            nn.Linear(dim, 2048),
            nn.GELU(),
            nn.Linear(2048, out_dim)
        )

    def forward(self, x):
        return F.normalize(self.net(x), dim=-1)

class SSLBackbone(nn.Module):
    def __init__(self):
        super().__init__()
        # weights=None means the encoder starts from random weights.
        # Replace this with pretrained weights when fine-tuning a detector.
        self.encoder = vit_b_16(weights=None)
        # Remove the classifier because SSL needs representation vectors,
        # not ImageNet class logits.
        self.encoder.heads = nn.Identity()
        self.projector = ProjectionHead()

    def forward(self, images):
        features = self.encoder(images)
        return self.projector(features)

student = SSLBackbone()
teacher = SSLBackbone()
teacher.load_state_dict(student.state_dict())
for p in teacher.parameters():
    # Teacher parameters are not trained by backpropagation.
    # They are updated by exponential moving average of the student.
    p.requires_grad = False

ssl_hyperparams = {
    "lr": 1e-4,          # learning rate for trainable student parameters
    "weight_decay": 0.04, # regularizes large weights in ViT/projector
    "ema": 0.996,        # teacher update momentum; higher is smoother
    "temperature": 0.1   # sharpens similarity distributions
}

import torch
from torchvision import transforms

ssl_model = SSLBackbone()
ssl_model.load_state_dict(torch.load("ssl_backbone.pt", map_location="cpu"))
ssl_model.eval()

preprocess = transforms.Compose([
    # img_size must match the resolution used by the SSL backbone.
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    # Normalization keeps input scale consistent with common ViT training.
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

with torch.no_grad():
    # no_grad disables gradient storage during inference feature extraction.
    image = preprocess(pil_image).unsqueeze(0)
    embedding = ssl_model.encoder(image)

print(embedding.shape)  # [batch, feature_dim], for ViT-B often [1, 768]

# Conceptual transfer recipe:
# 1. Load the SSL encoder weights.
# 2. Use the encoder as the detector backbone.
# 3. Add a neck and detection head.
# 4. Fine-tune with labeled boxes.

for name, param in detector.backbone.named_parameters():
    # Freeze the SSL backbone first so the detector head learns stable boxes.
    # Trainable parameters at this stage: neck + detection head only.
    param.requires_grad = False

optimizer = torch.optim.AdamW(
    filter(lambda p: p.requires_grad, detector.parameters()),
    lr=2e-4,           # head learning rate; lower if loss is unstable
    weight_decay=0.05  # helps prevent overfitting on small box datasets
)

for images, targets in train_loader:
    losses = detector(images, targets)
    loss = sum(losses.values())
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# Then unfreeze the SSL backbone with a smaller learning rate, for example
# backbone lr=2e-5 and head lr=2e-4, so pretrained features are not destroyed.

How an SSL Backbone Connects to a YOLO Detector

1. Replace or initialize

Backbone

The SSL encoder replaces the normal YOLO feature extractor, or initializes matching layers when the architectures are compatible.

2. Adapt feature maps

Neck

YOLO expects multi-scale feature maps. A small adapter or FPN/PAN neck converts SSL features into detection-friendly scales.

3. Predict objects

Head

The YOLO head remains supervised. It learns box coordinates, object confidence, and class probabilities from labeled annotations.

Input image

to

SSL backbone features

to

YOLO neck

to

YOLO detection head

to

Boxes, classes, confidence

In practice, YOLO backbones are usually CNN/CSP-style modules, while many SSL models are ViT-based. When shapes do not match directly, students should treat the SSL model as a feature extractor and add adapter layers before the YOLO neck. Freeze the SSL backbone first, train the neck/head, then unfreeze carefully with a smaller learning rate.

YOLO12 Training and Inference for All Model Sizes

YOLO12 detection weights are available in five sizes: yolo12n.pt, yolo12s.pt, yolo12m.pt, yolo12l.pt, and yolo12x.pt. Choose nano/small for class demos and limited GPUs, medium/large/xlarge for higher accuracy when memory and time allow.

nNano: fastest, lowest memory

sSmall: balanced classroom default

mMedium: stronger accuracy

lLarge: slower, better features

xXlarge: most expensive option

Dataset YAML

path: /content/datasets/road_objects
train: images/train
val: images/val
test: images/test

names:
  0: person
  1: bicycle
  2: car
  3: bus

Install

pip install ultralytics

python -c "import ultralytics; ultralytics.checks()"

from ultralytics import YOLO

model_size = "s"  # choose from: "n", "s", "m", "l", "x"
model = YOLO(f"yolo12{model_size}.pt")

results = model.train(
    data="data.yaml",       # dataset file with train/val paths and class names
    epochs=100,             # full passes over the training set
    imgsz=640,              # input image size; larger can improve small objects
    batch=16,               # lower this if GPU memory is limited
    lr0=0.01,               # initial learning rate used by the YOLO optimizer
    weight_decay=0.0005,    # regularization for trainable weights
    device=0,               # GPU id; use "cpu" if no CUDA GPU is available
    project="cv_course_runs",
    name=f"yolo12{model_size}_objects"
)

from ultralytics import YOLO

model_size = "s"
model = YOLO(f"cv_course_runs/yolo12{model_size}_objects/weights/best.pt")

results = model.predict(
    source="test_images/",  # image, folder, video, webcam id, or URL
    imgsz=640,              # must be compatible with training resolution
    conf=0.25,              # confidence threshold; raise to reduce false positives
    iou=0.7,                # NMS IoU threshold; lower to suppress overlaps harder
    save=True               # writes annotated predictions to runs/detect/
)

for result in results:
    print(result.boxes.xyxy, result.boxes.conf, result.boxes.cls)

from ultralytics import YOLO

model_size = "s"
model = YOLO(f"cv_course_runs/yolo12{model_size}_objects/weights/best.pt")
metrics = model.val(
    data="data.yaml",
    imgsz=640,  # same image size used during training
    batch=16    # evaluation batch size; reduce if memory is low
)

print("mAP50-95:", metrics.box.map)
print("mAP50:", metrics.box.map50)
print("Precision:", metrics.box.mp)
print("Recall:", metrics.box.mr)

References for Students

DINOv2 paper DINOv3 by Meta AI BYOL paper I-JEPA paper MAE paper Ultralytics YOLO12 docs