Convolutional neural networks (CNN)
A convolutional neural network (CNN) is a type of deep learning architecture specifically designed for processing grid-structured data like images. Its defining operation is convolution: sliding small learnable filters (typically 3x3 or 5x5 pixels) across the input image to detect local patterns like edges, corners, and textures. By stacking many convolutional layers, the network learns to recognize increasingly complex features, from simple edges in early layers to object parts and full objects in deeper layers.
A typical CNN architecture consists of convolutional layers (feature extraction), pooling layers (spatial downsampling to reduce computation), and fully connected layers (classification). Modern CNNs add batch normalization (stabilizes training), residual/skip connections (enables very deep networks like ResNet), and squeeze-and-excitation blocks (channel attention). Popular architectures include VGG, ResNet, EfficientNet, ConvNeXt, and the convolutional backbones used in YOLO detectors.
CNNs dominated computer vision from 2012 (AlexNet) through 2020 and remain widely used, especially for real-time and edge deployment where their efficient local computation is an advantage over transformer architectures. Most object detection and segmentation models still use CNN backbones (ResNet, CSPDarknet, EfficientNet) as feature extractors, even when paired with transformer-based detection heads.


