Encoder-Decoder Architecture

An encoder-decoder architecture processes information in two stages: the encoder compresses input data into a compact internal representation, and the decoder expands that representation back into the desired output format. In image segmentation, the encoder is typically a classification backbone (ResNet, EfficientNet, ViT) that progressively reduces spatial resolution while extracting semantic features. The decoder then upsamples those features back to the original image resolution to produce per-pixel predictions.

The key design challenge is preserving spatial detail through the bottleneck. U-Net solved this with skip connections that concatenate encoder features directly to the corresponding decoder layer, giving the decoder access to both high-level semantics and fine-grained spatial information. Feature Pyramid Networks (FPN) take a similar multi-scale approach for object detection, merging features from different encoder stages. Transformer-based decoders (used in DETR, Mask2Former, SAM) replace the upsampling path with cross-attention between learned queries and encoder features.

This pattern appears across computer vision: autoencoders for anomaly detection and representation learning, image-to-image translation (pix2pix, CycleGAN), super-resolution networks, depth estimation models, and video prediction architectures. The encoder can be swapped independently of the decoder, making it easy to upgrade backbones as better ones become available.

Get Started Now

Get Started using Datature’s platform now for free.