The F1 score is the harmonic mean of precision and recall, providing a single number that balances both metrics. It's calculated as 2 * (precision * recall) / (precision + recall), and ranges from 0 to 1. Unlike a simple average, the harmonic mean penalizes extreme imbalances: if precision is 0.95 but recall is 0.10, the F1 score is 0.18 rather than 0.53, correctly reflecting that the model is missing most positive cases.
F1 is most useful when you care equally about false positives and false negatives, and when the class distribution is imbalanced (making accuracy misleading). A defect detection system that catches 95% of defects but also flags 30% of good parts as defective would have a mediocre F1, surfacing the precision problem that accuracy alone might hide.
Variants handle multi-class scenarios: macro F1 computes F1 per class then averages (treats all classes equally), micro F1 pools all true/false positives across classes (biased toward frequent classes), and weighted F1 scales each class's F1 by its support count. In object detection, F1 is computed at specific confidence thresholds, and the threshold that maximizes F1 is often reported alongside precision-recall curves and mAP.


