Encoder-Decoder Architecture

An encoder-decoder architecture processes information in two stages: the encoder compresses input data into a compact internal representation, and the decoder expands that representation back into the desired output format. In image segmentation, the encoder is typically a classification backbone (ResNet, EfficientNet, ViT) that progressively reduces spatial resolution while extracting semantic features. The decoder then upsamples those features back to the original image resolution to produce per-pixel predictions.

The key design challenge is preserving spatial detail through the bottleneck. U-Net solved this with skip connections that concatenate encoder features directly to the corresponding decoder layer, giving the decoder access to both high-level semantics and fine-grained spatial information. Feature Pyramid Networks (FPN) take a similar multi-scale approach for object detection, merging features from different encoder stages. Transformer-based decoders (used in DETR, Mask2Former, SAM) replace the upsampling path with cross-attention between learned queries and encoder features.

This pattern appears across computer vision: autoencoders for anomaly detection and representation learning, image-to-image translation (pix2pix, CycleGAN), super-resolution networks, depth estimation models, and video prediction architectures. The encoder can be swapped independently of the decoder, making it easy to upgrade backbones as better ones become available.

Resources

Relevant Blog Posts ↘

Glossary

Our Blog

Documentation

A Comprehensive Guide to 3D Models for Medical Image Segmentation

MIN READ

March 4, 2026

This article introduces 3D segmentation, partitioning volumetric data into labeled regions for applications in medical imaging, robotics, and more. Focusing on 3D semantic segmentation, it uses the Swin UNETR architecture for brain tumor segmentation as an example. The article covers core concepts, training on the BraTS dataset including MRI normalization, input/output processing, computational challenges, and adapting Swin UNETR for 3D image classification.

Read

A Guide to Using DeepLabV3 for Semantic Segmentation

MIN READ

March 4, 2026

We are excited to introduce one of our Nexus models: DeepLabV3, a state-of-the-art multi-scale semantic segmentation model that can support any use case.

Read

Supporting Fully Convolutional Networks (and U-Net) for Image Segmentation

MIN READ

March 4, 2026

We are excited to introduce two new models on Nexus - Fully Convolutional Networks (FCN) and U-Nets, both popular semantic segmentation models.

Read

Get Started Now

Get Started using Datature’s computer vision platform now for free.

Book Demo