Resources

Glossary

Definitions of AI terms, platform features, and machine learning concepts.

Reset

Action Recognition

Action recognition is the process of identifying and categorizing human actions or movements in videos or images, such as walking, running, or dancing, to enable computer systems to understand and respond to these actions automatically.

A

Activation Function

A mathematical function applied after each neural network layer that introduces non-linearity, allowing the model to learn complex patterns like edges, textures, and shapes rather than only simple linear relationships.

A

Active learning

Active learning is a training approach where the model doesn’t just passively consume labeled data - it actively chooses which unlabeled examples would be most valuable to label next. By prioritizing samples the model is uncertain about or that best cover the data space, you can reach a target accuracy with far fewer labeled examples (and lower labeling cost) than random labeling.

A

Agentic Vision

Agentic vision refers to AI systems that autonomously perceive visual information, plan multi-step actions, and execute tasks in visual environments, combining VLMs with tool use, memory, and decision-making capabilities.

A

Anchor Box

Anchor box is the predefined bounding boxes which is usually seen in object detection models.

A

Anchor-Free Detection

An object detection approach that predicts bounding boxes directly from feature map locations, removing the need for predefined anchor box templates and the manual tuning they require.

A

Annotation

Annotation is the process of labeling your data to teach your deep learning model the outcome your want to predict. Generally, bounding boxes are used to train for object detection and polygons are used to train for instance segmentation.

A

Annotation Format

Annotation format is the specific method to encode the annotation and to describe the bounding box’s size and position (COCO, YOLO, TXT, etc).

A

Anomaly Detection

The task of identifying patterns in images that deviate from what is expected, such as defective products on a manufacturing line, unusual structures in medical scans, or damaged infrastructure in inspection photos.

A

Application Programming Interface (API)

An application programming interface is a mechanism that provides components to convey with other software within databases or applications. Companies can use it to assist digital transformation or an ecosystem. We use REST API to allow users to easily import their models into our platform.

A

Attention Mechanism

A technique that lets a neural network focus on the most relevant parts of its input by computing weighted combinations where important regions receive higher influence on the final prediction.

A

Attribute/attribute group

An attribute is the item of data that is utilized in machine learning, and the attribute groups define clusters of attributes to create the product’s additional information.

A

Augmentation

Augmentations are good for dataset robustness. It allows users to enhance their existing dataset through positional augmentations or color space augmentation. These augmentation techniques enable the model to not lean on specific features while training.‍

A

Automated Machine Learning (AutoML)

AutoML leads to automating the tasks to optimize the training models for application to the real world by themselves. It contains the whole process from loading a raw dataset to deploying the ML model.

A

Backpropagation

The algorithm neural networks use to learn — it calculates how much each weight contributed to prediction errors and adjusts them accordingly, repeating this process over every batch of training data.

B

Batch Normalization

A training technique that stabilizes learning by normalizing each layer's inputs to have zero mean and unit variance, allowing higher learning rates and reducing sensitivity to weight initialization.

B

Bounding box

A bounding box is a rectangular region of an image that concludes an object and is portrayed by its (x,y) coordinates‍.

B

CLIP (Contrastive Language-Image Pre-training)

CLIP is an OpenAI model that learns to connect images and text by training on 400 million image-text pairs from the internet, enabling zero-shot image classification and powering the vision encoders inside most modern VLMs.

C

COCO

COCO is an image dataset stored in the JSON format, gathering to compare different models’ performance and solve common object detection problems.‍

C

Chain-of-Thought Reasoning

A prompting technique where a model works through a problem step by step before producing a final answer, improving accuracy on tasks that require multi-step logic or spatial reasoning.

C

Class Imbalance

A dataset condition where some classes have far more examples than others, causing models to favor the majority class and miss the rare cases that actually matter unless corrected with resampling or modified loss functions.

C

Classification

Classification is a machine learning task where data is categorized into predefined classes or labels. The goal is to build a model that can predict the correct label for new, unseen data based on patterns and features learned from a training dataset. It's widely used in various applications, such as spam detection or image recognition.

C

Clustering

Clustering is an unsupervised technique that groups similar instances according to similarity, and the data points will not include labels.

C

Computer Vision

Computer Vision is the science of enabling computers to see and understand images and video. This is accomplished by developing algorithms that can make sense of visual content, for example detecting people or objects in an image or video, or being able to read road signs.‍

C

Confusion Matrix

A confusion matrix is a table used in machine learning to evaluate the performance of a classification model. It summarizes the model's predictions by showing the true positive, true negative, false positive, and false negative counts, enabling the assessment of accuracy, precision, recall, and other metrics.

C

Contrastive Learning

A self-supervised training method where a model learns to produce similar representations for related images and different representations for unrelated ones, without needing class labels.

C

Convolutional neural networks (CNN)

CNN is a neural network that at least has one convolutional layer. It is typically used for image recognition and identification.‍

C

Cross-Attention

Cross-attention is a transformer mechanism where one input sequence (such as text) attends to another (such as image patches), enabling vision-language models to fuse visual and textual information for tasks like captioning and grounding.

C

DETR (Detection Transformer)

An object detection architecture from Meta AI that uses learned object queries and a transformer decoder to directly predict detections, removing the need for anchor boxes and non-maximum suppression.

D

Data Labeling

The process of adding structured annotations to raw images — such as bounding boxes, segmentation masks, or class tags — so that models can learn from them. Label quality directly sets the ceiling for model performance.

D

Data Preprocessing

The steps applied to raw images and annotations before training — resizing, normalizing pixel values, converting color spaces, and transforming labels to match the model's expected input format.

D

Dataset Splitting

Dividing a labeled dataset into training, validation, and test subsets so the model can learn from one portion, tune hyperparameters on another, and be evaluated fairly on data it has never seen.

D

Deep Learning

A branch of machine learning that uses neural networks with many layers to automatically learn hierarchical features from data — raw pixels become edges, edges become textures, textures become parts, and parts become recognizable objects.

D

Depth Estimation

A computer vision task that predicts how far each pixel in an image is from the camera, producing a depth map used in autonomous driving, robotics, augmented reality, and 3D scene reconstruction.

D

Diffusion Models

A class of generative models that create images by learning to reverse a gradual noising process — starting from random static and iteratively refining it into a coherent image, as used in Stable Diffusion and DALL-E.

D

Domain Adaptation

Techniques for closing the gap between training data and real-world deployment data when they come from different conditions, such as a model trained on daytime images struggling when used at night.

D

Dropout

A regularization technique that randomly deactivates a fraction of neurons during each training step, forcing the network to spread learned features more evenly and reducing overfitting on small datasets.

D

Edge AI

Running trained models directly on local hardware — cameras, phones, industrial controllers — instead of sending data to cloud servers, enabling low-latency decisions, data privacy, and offline operation.

E

Encoder-Decoder Architecture

A neural network design where an encoder compresses input into a compact representation and a decoder expands it back to the desired output, commonly used in image segmentation with skip connections to preserve spatial detail.

E

Epoch

One complete pass through the entire training dataset. Most models train for multiple epochs, seeing each image many times, because a single pass rarely extracts all learnable patterns from the data.

E

Explainable AI

Methods and tools that show which parts of an image a model focused on when making a prediction, using techniques like Grad-CAM heatmaps or SHAP scores to make otherwise opaque decisions interpretable.

E

F1 Score

The harmonic mean of precision and recall, providing a single number that balances both metrics. Unlike a simple average, it penalizes extreme imbalances — high precision with low recall still yields a low F1.

F

Feature Pyramid Network (FPN)

A multi-scale feature extraction architecture that combines deep semantic features with shallow spatial detail through a top-down pathway, helping detection models handle objects at all sizes.

F

Few-Shot Learning

Few-shot learning is a machine learning approach where a model learns to recognize new categories or perform new tasks from just a handful of labeled examples, typically 1-10 samples per class.

F

Fine-Tuning

Continuing to train a pre-trained model on your specific dataset so it adapts its general visual knowledge to your particular classes, image style, and domain — reducing the labeled data and training time needed.

F

Foundation Models

Foundation models are pre-trained convolutional neural networks (CNNs) that have been trained on large image datasets. These models serve as a starting point for various computer vision tasks like object detection, image classification, and segmentation. They provide a foundation of learned features and patterns that can be fine-tuned for specific vision-related applications.

F

GAN (Generative Adversarial Network)

A generative model with two competing networks — a generator that creates synthetic images and a discriminator that tries to distinguish fakes from real photos — each improving through their rivalry.

G

Generative AI

Generative AI refers to artificial intelligence systems capable of generating data, content, or objects autonomously. These systems, often based on deep learning models like GANs, can produce images, text, audio, or other forms of data, allowing them to create new and original content based on patterns learned from training data.

G

Gesture Recognition

Gesture recognition is a technology that interprets human gestures or body movements to control and interact with computers or other devices. It allows users to convey commands, input data, or interact with a system through natural movements, making it valuable in applications like gaming, virtual reality, and user interfaces.

G

Gradient Descent

The optimization algorithm that trains neural networks by computing how much each weight contributes to prediction error, then adjusting all weights in the direction that reduces that error.

G

Ground Truth

The verified, human-annotated labels — bounding boxes, masks, or class tags — that a model is trained and evaluated against. Ground truth quality sets the ceiling for how accurate a model can become.

G

Hallucination (in Vision-Language Models)

Hallucination in vision-language models occurs when the model generates text describing objects, attributes, or relationships that are not actually present in the input image, producing confident but factually incorrect outputs.

H

Hyperparameter Tuning

The process of finding the best training configuration — learning rate, batch size, epochs, augmentation strength — since these settings control how well the model learns and must be chosen before training begins.

H

Image Captioning

Image captioning is the task of automatically generating a natural language description of an image, combining visual understanding with text generation to produce sentences like "a dog playing fetch in a park."

I

Image Embeddings

Fixed-length numerical vectors that represent an image's visual content in compact form. Similar-looking images produce vectors that are close together, enabling visual search, clustering, and duplicate detection.

I

Instance Segmentation

Instance segmentation is a computer vision task that combines object detection and semantic segmentation. It identifies and delineates individual objects within an image, assigning each pixel to a specific object instance. This provides a detailed understanding of the spatial extent and location of distinct objects in an image.

I

Intersection over Union (IoU)

A metric that measures overlap between a predicted bounding box or mask and the ground truth by dividing their intersection area by their union area. A score of 1.0 means perfect alignment; 0.0 means no overlap.

I

Keypoint Detection

Keypoint detection is a computer vision task that identifies and localizes specific points or landmarks in an image. These keypoints represent important features, such as corners or interest points, and are often used for tasks like object tracking, pose estimation, and image alignment.

K

Knowledge Distillation

A compression technique where a smaller "student" model is trained to replicate the behavior of a larger "teacher" model, transferring learned knowledge into a faster, more deployable form.

K

Learning Rate

The training hyperparameter that controls how much a model's weights change per gradient update. Too large and training becomes unstable; too small and the model converges slowly or gets stuck.

L

LoRA (Low-Rank Adaptation)

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that adapts large pre-trained models by injecting small trainable matrices into existing layers, reducing memory and compute requirements by 10-100x compared to full fine-tuning.

L

Loss Function

The mathematical function that measures how far a model's predictions are from the correct answers, producing the error signal that gradient descent uses to update weights during training.

L

Machine Learning Operations (MLOps)

MLOps, short for Machine Learning Operations, is a set of practices and tools that combine machine learning with DevOps to manage the end-to-end machine learning lifecycle. It encompasses model development, deployment, monitoring, and automation, enabling efficient, scalable, and reliable machine learning operations in production environments.

M

Mask R-CNN

A two-stage instance segmentation model that detects objects with bounding boxes, classifies them, and predicts a pixel-level mask for each one — identifying both what objects are present and which pixels belong to each.

M

Mean Average Precision (mAP)

The primary evaluation metric for object detection, summarizing both classification accuracy and localization quality into a single score by averaging precision across all classes and IoU thresholds.

M

Model Deployment

Model deployment is the act of making a machine learning model operational and accessible for real-world use. It involves integrating the model into a software application, cloud service, or other systems, so it can make predictions or decisions based on new data in a practical, automated, and scalable manner.

M

Model Inference

Running a trained model on new data to generate predictions — the production phase where the model outputs bounding boxes, class labels, segmentation masks, or other results from inputs it has never seen before.

M

Model Pruning

A compression technique that removes low-importance weights or entire structural units from a trained network to make it smaller and faster, with minimal impact on accuracy.

M

Model Serving

Deploying trained models behind stable APIs or inference endpoints so applications can send data and receive predictions, with production concerns like request batching, autoscaling, and version management handled automatically.

M

Multimodal AI

Multimodal AI refers to artificial intelligence systems that can process and reason across multiple types of data (such as images, text, audio, and video) rather than being limited to a single input format.

M

Multimodal Alignment

Multimodal alignment is the process of training a model so that related concepts from different data types (like an image of a dog and the word dog) map to nearby points in a shared representation space.

M

Multimodal Embedding

A multimodal embedding is a vector representation in a shared space where images, text, and other data types are mapped so that semantically similar items from different modalities are positioned close together.

M

Multimodal Fusion

Multimodal fusion is the process of combining information from different data types (such as images and text) into a unified representation that a model can reason over jointly.

M

Multimodal Learning

Training models to understand and reason across multiple data types at once — such as images and text together — so they learn how different modalities relate to each other.

M

Neural Network

A computational model made of layers of connected nodes that learn to transform input data into useful outputs by adjusting connection weights through exposure to training examples.

N

Non-Maximum Suppression (NMS)

A post-processing step in object detection that eliminates duplicate bounding box predictions for the same object by keeping the highest-confidence detection and discarding heavily overlapping alternatives.

N

ONNX (Open Neural Network Exchange)

An open-source model format that lets you train in one framework (like PyTorch) and deploy on different hardware through a common runtime, without rewriting inference code.

O

Object Detection

Object detection is a computer vision task that involves identifying and locating objects within images or videos. It goes beyond image classification by not only classifying objects but also drawing bounding boxes around them, providing information about their positions in the image. It's widely used in applications like autonomous driving and image analysis.

O

Object Tracking

Object tracking is a computer vision process that involves monitoring and following the movement of objects within a sequence of images or a video stream over time. It assigns a unique identity to each object and tracks its position and motion as it moves through the frames, enabling applications like video surveillance and autonomous vehicles.

O

Open Vocabulary Detection

Open vocabulary detection is an object detection approach that identifies objects using free-text descriptions rather than a fixed list of class labels, removing the need to retrain for new object categories.

O

OpenCV

OpenCV (Open Source Computer Vision Library) is an open-source library of programming functions for real-time computer vision, image processing, and machine learning. Originally developed by Intel in 2000, it is now the most widely used computer vision library in the world.

O

Optical Character Recognition (OCR)

The technology that converts images of printed, handwritten, or scene text into machine-readable characters, used for document digitization, license plate reading, and receipt processing.

O

Optical Flow

The pattern of apparent motion between consecutive video frames, represented as a vector field where each pixel gets a direction and magnitude showing how it moved from one frame to the next.

O

Oriented Bounding Box (OBB)

A rotated rectangle that tightly encloses an object at an arbitrary angle, used when standard axis-aligned boxes would waste significant area on tilted targets like ships in satellite imagery or angled text.

O

Panoptic Segmentation

A unified segmentation task that labels every pixel in an image with both a class (sky, road, car) and an instance identity, combining background labeling with individual object separation.

P

Patch Embedding

Patch embedding is the process of splitting an image into fixed-size patches and converting each patch into a vector representation, transforming a 2D image into a sequence of tokens that a transformer can process.

P

Pose Estimation

Pose estimation is a computer vision task that identifies and calculates the positions and orientations of key body parts or objects within an image or video, often in the context of human pose analysis. It's used in applications such as motion capture, gesture recognition, and augmented reality.

P

Precision and Recall

Two metrics that measure detection quality from opposite angles — precision is the fraction of correct predictions out of all predictions made, recall is the fraction of actual objects the model found.

P

Prompt Engineering for Vision

Prompt engineering for vision is the practice of crafting effective text inputs to guide vision-language models toward accurate and useful outputs, including techniques specific to visual tasks like spatial descriptions and chain-of-thought reasoning.

P

Quantization

A model optimization technique that reduces the numerical precision of a neural network's weights and activations - for example, converting 32-bit floating-point values to 8-bit integers (INT8). Quantization significantly reduces model size, memory usage, and inference latency, making it essential for deploying models on edge devices and mobile hardware.

Q

ROC Curve

A plot that shows a binary classifier's performance across all confidence thresholds, with the area under the curve (AUC) summarizing overall separation ability into a single number between 0.5 and 1.0.

R

Real-Time Object Detection

Models that locate and classify objects fast enough for live applications, typically at 30+ frames per second. The YOLO family has defined this space since 2015 by treating detection as a single-pass problem.

R

Referring Expression Comprehension

Referring expression comprehension is the task of localizing a specific object in an image from a natural language description that distinguishes it from other objects, such as the taller glass on the right.

R

Regularization

A collection of training techniques — including weight decay, dropout, and data augmentation — that prevent a model from memorizing training data and instead force it to learn patterns that generalize to new inputs.

R

Retrieval-Augmented Generation (RAG)

An architecture that grounds language model responses in retrieved external documents or images rather than relying on memorized training data, reducing made-up answers and enabling responses about new information.

R

Segment Anything Model (SAM)

Meta AI's foundation model for image segmentation that produces high-quality masks from simple prompts — a point click, bounding box, or text description — without any task-specific training.

S

Self-Supervised Learning

A training approach where the model generates its own supervision from raw data — for example, masking part of an image and learning to reconstruct it — building useful visual features without human annotation.

S

Semantic Segmentation

Semantic segmentation is a computer vision task that classifies each pixel in an image to a specific object category or class. It provides a detailed understanding of the objects' spatial layout and enables the delineation of object boundaries in an image, making it useful in applications like image analysis and autonomous driving.

S

Semi-Supervised Learning

A training approach that combines a small set of labeled examples with a large pool of unlabeled data, using techniques like pseudo-labeling to extract learning signal from both.

S

SigLIP

SigLIP (Sigmoid Loss for Language-Image Pre-training) is Google's successor to CLIP that replaces the softmax contrastive loss with a sigmoid-based loss, enabling better scaling and serving as the vision encoder in PaliGemma and other modern VLMs.

S
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
No results found.
Resources

Learn more from Datature’s team.

Get Started Now

Get Started using Datature’s computer vision platform now for free.