A single-page visual guide for understanding how modern self-supervised vision
backbones learn representations before they are transferred to detection tasks.
Unlabeled images
to
SSL backbone
to
Detector head
to
Boxes + classes
What Students Should Notice
No labels firstSSL learns image features from pretext signals such as views, masks, teachers, or target embeddings.
Backbone mattersThe pretrained encoder becomes a feature extractor for classification, segmentation, or object detection.
Detection still needs boxesAfter SSL pretraining, a detector is fine-tuned with labeled bounding boxes and class labels.
DINOv2 trains robust visual features without supervision by scaling self-distillation,
curated data, and ViT backbones, then distilling strong teachers into smaller models.
Learns by
Matching student features to a teacher representation from augmented image views.
Detection transfer
Use the encoder as a pretrained backbone, then fine-tune a detection head on labeled boxes.
PyTorch focus
Teacher network, student network, centering/sharpening, and feature extraction.
Build the Detection Pipeline
Start with many class-relevant images. They do not need bounding boxes for SSL pretraining.
PyTorch Code Walkthrough
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision.models import vit_b_16
class ProjectionHead(nn.Module):
def __init__(self, dim=768, out_dim=256):
super().__init__()
# dim=768 is the ViT-B/16 feature width.
# out_dim=256 is the compact embedding used by the SSL loss.
# These projection layers are trainable during SSL pretraining.
self.net = nn.Sequential(
nn.Linear(dim, 2048),
nn.GELU(),
nn.Linear(2048, out_dim)
)
def forward(self, x):
return F.normalize(self.net(x), dim=-1)
class SSLBackbone(nn.Module):
def __init__(self):
super().__init__()
# weights=None means the encoder starts from random weights.
# Replace this with pretrained weights when fine-tuning a detector.
self.encoder = vit_b_16(weights=None)
# Remove the classifier because SSL needs representation vectors,
# not ImageNet class logits.
self.encoder.heads = nn.Identity()
self.projector = ProjectionHead()
def forward(self, images):
features = self.encoder(images)
return self.projector(features)
student = SSLBackbone()
teacher = SSLBackbone()
teacher.load_state_dict(student.state_dict())
for p in teacher.parameters():
# Teacher parameters are not trained by backpropagation.
# They are updated by exponential moving average of the student.
p.requires_grad = False
ssl_hyperparams = {
"lr": 1e-4, # learning rate for trainable student parameters
"weight_decay": 0.04, # regularizes large weights in ViT/projector
"ema": 0.996, # teacher update momentum; higher is smoother
"temperature": 0.1 # sharpens similarity distributions
}
import torch
from torchvision import transforms
ssl_model = SSLBackbone()
ssl_model.load_state_dict(torch.load("ssl_backbone.pt", map_location="cpu"))
ssl_model.eval()
preprocess = transforms.Compose([
# img_size must match the resolution used by the SSL backbone.
transforms.Resize((224, 224)),
transforms.ToTensor(),
# Normalization keeps input scale consistent with common ViT training.
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
with torch.no_grad():
# no_grad disables gradient storage during inference feature extraction.
image = preprocess(pil_image).unsqueeze(0)
embedding = ssl_model.encoder(image)
print(embedding.shape) # [batch, feature_dim], for ViT-B often [1, 768]
# Conceptual transfer recipe:
# 1. Load the SSL encoder weights.
# 2. Use the encoder as the detector backbone.
# 3. Add a neck and detection head.
# 4. Fine-tune with labeled boxes.
for name, param in detector.backbone.named_parameters():
# Freeze the SSL backbone first so the detector head learns stable boxes.
# Trainable parameters at this stage: neck + detection head only.
param.requires_grad = False
optimizer = torch.optim.AdamW(
filter(lambda p: p.requires_grad, detector.parameters()),
lr=2e-4, # head learning rate; lower if loss is unstable
weight_decay=0.05 # helps prevent overfitting on small box datasets
)
for images, targets in train_loader:
losses = detector(images, targets)
loss = sum(losses.values())
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Then unfreeze the SSL backbone with a smaller learning rate, for example
# backbone lr=2e-5 and head lr=2e-4, so pretrained features are not destroyed.
How an SSL Backbone Connects to a YOLO Detector
1. Replace or initialize
Backbone
The SSL encoder replaces the normal YOLO feature extractor, or initializes matching layers when the architectures are compatible.
2. Adapt feature maps
Neck
YOLO expects multi-scale feature maps. A small adapter or FPN/PAN neck converts SSL features into detection-friendly scales.
3. Predict objects
Head
The YOLO head remains supervised. It learns box coordinates, object confidence, and class probabilities from labeled annotations.
Input image
to
SSL backbone features
to
YOLO neck
to
YOLO detection head
to
Boxes, classes, confidence
In practice, YOLO backbones are usually CNN/CSP-style modules, while many SSL models are ViT-based. When shapes do not match directly, students should treat the SSL model as a feature extractor and add adapter layers before the YOLO neck. Freeze the SSL backbone first, train the neck/head, then unfreeze carefully with a smaller learning rate.
YOLO12 Training and Inference for All Model Sizes
YOLO12 detection weights are available in five sizes: yolo12n.pt, yolo12s.pt, yolo12m.pt, yolo12l.pt, and yolo12x.pt. Choose nano/small for class demos and limited GPUs, medium/large/xlarge for higher accuracy when memory and time allow.
nNano: fastest, lowest memory
sSmall: balanced classroom default
mMedium: stronger accuracy
lLarge: slower, better features
xXlarge: most expensive option
Dataset YAML
path: /content/datasets/road_objects
train: images/train
val: images/val
test: images/test
names:
0: person
1: bicycle
2: car
3: bus
from ultralytics import YOLO
model_size = "s" # choose from: "n", "s", "m", "l", "x"
model = YOLO(f"yolo12{model_size}.pt")
results = model.train(
data="data.yaml", # dataset file with train/val paths and class names
epochs=100, # full passes over the training set
imgsz=640, # input image size; larger can improve small objects
batch=16, # lower this if GPU memory is limited
lr0=0.01, # initial learning rate used by the YOLO optimizer
weight_decay=0.0005, # regularization for trainable weights
device=0, # GPU id; use "cpu" if no CUDA GPU is available
project="cv_course_runs",
name=f"yolo12{model_size}_objects"
)
from ultralytics import YOLO
model_size = "s"
model = YOLO(f"cv_course_runs/yolo12{model_size}_objects/weights/best.pt")
results = model.predict(
source="test_images/", # image, folder, video, webcam id, or URL
imgsz=640, # must be compatible with training resolution
conf=0.25, # confidence threshold; raise to reduce false positives
iou=0.7, # NMS IoU threshold; lower to suppress overlaps harder
save=True # writes annotated predictions to runs/detect/
)
for result in results:
print(result.boxes.xyxy, result.boxes.conf, result.boxes.cls)
from ultralytics import YOLO
model_size = "s"
model = YOLO(f"cv_course_runs/yolo12{model_size}_objects/weights/best.pt")
metrics = model.val(
data="data.yaml",
imgsz=640, # same image size used during training
batch=16 # evaluation batch size; reduce if memory is low
)
print("mAP50-95:", metrics.box.map)
print("mAP50:", metrics.box.map50)
print("Precision:", metrics.box.mp)
print("Recall:", metrics.box.mr)