Knowledge Distillation

Knowledge distillation transfers the learned behavior of a large, accurate "teacher" model into a smaller, faster "student" model. The student is trained not just on the ground truth labels, but also on the teacher's output probabilities ("soft labels"). These soft labels carry richer information than hard labels: when a teacher says an image is 70% cat, 20% dog, and 10% fox, it encodes visual similarity relationships that a hard "cat" label doesn't capture.

The original method (Hinton et al., 2015) uses a temperature parameter to soften the teacher's output distribution, making the relative probabilities between classes more visible. The student's loss combines a standard cross-entropy term (against ground truth) with a KL divergence term (against the teacher's softened outputs). Feature-based distillation goes deeper: instead of just matching output probabilities, the student learns to reproduce the teacher's intermediate feature representations at specific layers. Relation-based distillation preserves the similarity structure between samples rather than individual outputs.

Knowledge distillation is standard practice for deploying vision models on edge hardware. A large ResNet-152 or ViT-Large teacher trains on a server, then distills its knowledge into a MobileNet or EfficientNet-Lite student that runs on phones or embedded devices. YOLO models use distillation during training to boost small-variant accuracy. The student typically recovers 95-99% of the teacher's accuracy at 5-20x faster inference speed.

Resources

Relevant Blog Posts ↘

Glossary

Our Blog

Documentation

A Comprehensive Guide to Neural Network Model Pruning

MIN READ

March 7, 2026

Model pruning is a technique to remove unimportant parameters from neural networks, enhancing efficiency without significantly compromising performance. It balances model accuracy with size reduction, ideal for deployment in constrained environments or real-time applications.

Read

Introducing Post-Training Model Quantization Feature and Mechanics Explained

MIN READ

March 4, 2026

The article explores the concept of quantization in machine learning, detailing how it reduces the bit representation of data in models, thus enhancing computational efficiency and reducing memory footprint. We are announcing our latest patch that supports this in-platform.

Read

Get Started Now

Get Started using Datature’s computer vision platform now for free.

Book Demo