CSE 438 Digital Image Processing

Semantic Segmentation

Semantic segmentation assigns a class label to every pixel. Students should understand the full path from image tensors to dense class maps, loss functions, metrics, and modern foundation models.

road
car
sky
tree
person
building
Input image
H x W x 3
to
Encoder
features
to
Decoder
upsampling
to
Logits
C x H x W
to
Mask
argmax per pixel
Pixel classifier Every pixel receives a class score, so the output has one channel per semantic class.
Dense supervision Training uses masks where each pixel value is a class id such as road, tree, car, or background.
Evaluation Mean IoU compares predicted and ground-truth regions class by class, then averages the result.
Promptable masks

SAM

Segment Anything introduced promptable segmentation with point, box, and mask prompts, trained on the large SA-1B mask dataset.

Images + video

SAM 2

SAM 2 extends promptable segmentation to images and videos with a streaming memory design for object masks across frames.

Open-vocabulary

SEEM

SEEM supports interactive segmentation from points, boxes, scribbles, masks, text, and referring expressions.

In-context segmentation

SegGPT

SegGPT frames segmentation as in-context visual prompting, learning to produce masks from example image-mask pairs.

Universal segmentation

OneFormer

OneFormer uses one transformer architecture for semantic, instance, and panoptic segmentation with task-conditioned training.

import torch
from torch.utils.data import Dataset
from PIL import Image
import numpy as np

class SegmentationDataset(Dataset):
    def __init__(self, image_paths, mask_paths, image_transform=None):
        self.image_paths = image_paths
        self.mask_paths = mask_paths
        self.image_transform = image_transform

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        image = Image.open(self.image_paths[idx]).convert("RGB")
        mask = Image.open(self.mask_paths[idx])

        # Mask pixels must be integer class ids:
        # 0=background, 1=road, 2=car, and so on.
        mask = torch.as_tensor(np.array(mask), dtype=torch.long)

        if self.image_transform:
            image = self.image_transform(image)

        return image, mask