Quantization
Quantization reduces the numerical precision of a model's weights and activations, typically from 32-bit floating point (FP32) to 16-bit (FP16), 8-bit integer (INT8), or even 4-bit (INT4). This makes the model smaller (up to 4x for INT8), faster (integer operations are cheaper than floating point on most hardware), and more memory-efficient, which is critical for deploying on edge devices and reducing cloud inference costs.
Two main approaches exist. Post-training quantization (PTQ) converts a pre-trained FP32 model to lower precision using a small calibration dataset to determine the appropriate scaling factors for each layer. This is fast and easy but can lose accuracy, especially at INT4. Quantization-aware training (QAT) simulates quantization during training, allowing the model to learn to compensate for reduced precision. QAT typically preserves more accuracy than PTQ but requires retraining.
In practice, INT8 quantization with TensorRT or TFLite delivers 2-4x speedup over FP32 with less than 1% accuracy drop on most vision models (YOLO, EfficientNet, ResNet). FP16 is essentially free on modern GPUs (same speed as FP32 on Tensor Cores, half the memory). INT4 and lower precisions are mainly used for large language models and are still experimental for vision tasks. Datature's deployment pipeline supports automatic quantization for edge targets.

