Optical Character Recognition (OCR) converts images of printed, handwritten, or scene text into machine-readable character strings. Traditional OCR follows a detect-then-recognize approach: a text detection model (EAST, CRAFT, DBNet) finds text regions as bounding boxes or polygons, then a recognition model (CRNN, ASTER, TrOCR) reads each cropped region character by character or as whole words.
Modern end-to-end systems like PaddleOCR, EasyOCR, and Google Cloud Vision unify both stages. Transformer-based models (TrOCR, Donut, GOT-OCR) treat text recognition as a sequence-to-sequence problem, handling curved text, rotated text, and multiple languages in a single pass. Document AI extends basic OCR with layout analysis, detecting tables, headers, paragraphs, and key-value pairs to extract structured data from invoices, forms, receipts, and technical drawings.
In manufacturing and logistics, OCR powers barcode fallback reading, serial number tracking, label verification, and compliance documentation. The main challenges are low-contrast text, perspective distortion, degraded print quality, and mixed-script environments where multiple languages appear in the same document. Recent VLM-based approaches (using models like Qwen-VL or GPT-4V) can read and reason about text in images without dedicated OCR pipelines, blurring the line between OCR and general visual understanding.
