An encoder-decoder architecture processes information in two stages: the encoder compresses input data into a compact internal representation, and the decoder expands that representation back into the desired output format. In image segmentation, the encoder is typically a classification backbone (ResNet, EfficientNet, ViT) that progressively reduces spatial resolution while extracting semantic features. The decoder then upsamples those features back to the original image resolution to produce per-pixel predictions.
The key design challenge is preserving spatial detail through the bottleneck. U-Net solved this with skip connections that concatenate encoder features directly to the corresponding decoder layer, giving the decoder access to both high-level semantics and fine-grained spatial information. Feature Pyramid Networks (FPN) take a similar multi-scale approach for object detection, merging features from different encoder stages. Transformer-based decoders (used in DETR, Mask2Former, SAM) replace the upsampling path with cross-attention between learned queries and encoder features.
This pattern appears across computer vision: autoencoders for anomaly detection and representation learning, image-to-image translation (pix2pix, CycleGAN), super-resolution networks, depth estimation models, and video prediction architectures. The encoder can be swapped independently of the decoder, making it easy to upgrade backbones as better ones become available.


