Foundation Models
A foundation model is a large neural network pre-trained on a massive, broad dataset that serves as a general-purpose starting point for many downstream tasks. Instead of training a model from scratch for each specific application, you take a foundation model that already understands general visual concepts and adapt it to your domain through fine-tuning, prompting, or feature extraction.
In computer vision, foundation models include large pre-trained backbones like DINOv2 (self-supervised on 142 million images), CLIP (trained on 400 million image-text pairs for zero-shot recognition), SAM (Segment Anything Model, trained on 1 billion masks for universal segmentation), and Florence (Microsoft's multi-task vision model). These models capture rich, transferable visual representations that generalize well across diverse tasks and domains without task-specific training.
The practical impact is significant: a foundation model pre-trained on web-scale data encodes knowledge about edges, textures, objects, spatial relationships, and even semantic concepts that would take prohibitive amounts of domain-specific data to learn from scratch. Fine-tuning a foundation model on a few hundred labeled images in your target domain often outperforms training a smaller model on thousands of images from scratch.


