Resources

Glossary

Definitions of AI terms, platform features, and machine learning concepts.

Reset

Action Recognition

Action recognition is the process of identifying and categorizing human actions or movements in videos or images, such as walking, running, or dancing, to enable computer systems to understand and respond to these actions automatically.

A

Activation Function

A mathematical function applied after each neural network layer that introduces non-linearity, allowing the model to learn complex patterns like edges, textures, and shapes rather than only simple linear relationships.

A

Active learning

Active learning is a training approach where the model doesn’t just passively consume labeled data - it actively chooses which unlabeled examples would be most valuable to label next. By prioritizing samples the model is uncertain about or that best cover the data space, you can reach a target accuracy with far fewer labeled examples (and lower labeling cost) than random labeling.

A

Anchor Box

Anchor box is the predefined bounding boxes which is usually seen in object detection models.

A

Anchor-Free Detection

An object detection approach that predicts bounding boxes directly from feature map locations, removing the need for predefined anchor box templates and the manual tuning they require.

A

Annotation

Annotation is the process of labeling your data to teach your deep learning model the outcome your want to predict. Generally, bounding boxes are used to train for object detection and polygons are used to train for instance segmentation.

A

Annotation Format

Annotation format is the specific method to encode the annotation and to describe the bounding box’s size and position (COCO, YOLO, TXT, etc).

A

Anomaly Detection

The task of identifying patterns in images that deviate from what is expected, such as defective products on a manufacturing line, unusual structures in medical scans, or damaged infrastructure in inspection photos.

A

Application Programming Interface (API)

An application programming interface is a mechanism that provides components to convey with other software within databases or applications. Companies can use it to assist digital transformation or an ecosystem. We use REST API to allow users to easily import their models into our platform.

A

Attention Mechanism

A technique that lets a neural network focus on the most relevant parts of its input by computing weighted combinations where important regions receive higher influence on the final prediction.

A

Attribute/attribute group

An attribute is the item of data that is utilized in machine learning, and the attribute groups define clusters of attributes to create the product’s additional information.

A

Augmentation

Augmentations are good for dataset robustness. It allows users to enhance their existing dataset through positional augmentations or color space augmentation. These augmentation techniques enable the model to not lean on specific features while training.‍

A

Automated Machine Learning (AutoML)

AutoML leads to automating the tasks to optimize the training models for application to the real world by themselves. It contains the whole process from loading a raw dataset to deploying the ML model.

A

Backpropagation

The algorithm neural networks use to learn — it calculates how much each weight contributed to prediction errors and adjusts them accordingly, repeating this process over every batch of training data.

B

Batch Normalization

A training technique that stabilizes learning by normalizing each layer's inputs to have zero mean and unit variance, allowing higher learning rates and reducing sensitivity to weight initialization.

B

Bounding box

A bounding box is a rectangular region of an image that concludes an object and is portrayed by its (x,y) coordinates‍.

B

COCO

COCO is an image dataset stored in the JSON format, gathering to compare different models’ performance and solve common object detection problems.‍

C

Chain-of-Thought Reasoning

A prompting technique where a model works through a problem step by step before producing a final answer, improving accuracy on tasks that require multi-step logic or spatial reasoning.

C

Class Imbalance

A dataset condition where some classes have far more examples than others, causing models to favor the majority class and miss the rare cases that actually matter unless corrected with resampling or modified loss functions.

C

Classification

Classification is a machine learning task where data is categorized into predefined classes or labels. The goal is to build a model that can predict the correct label for new, unseen data based on patterns and features learned from a training dataset. It's widely used in various applications, such as spam detection or image recognition.

C

Clustering

Clustering is an unsupervised technique that groups similar instances according to similarity, and the data points will not include labels.

C

Computer Vision

Computer Vision is the science of enabling computers to see and understand images and video. This is accomplished by developing algorithms that can make sense of visual content, for example detecting people or objects in an image or video, or being able to read road signs.‍

C

Confusion Matrix

A confusion matrix is a table used in machine learning to evaluate the performance of a classification model. It summarizes the model's predictions by showing the true positive, true negative, false positive, and false negative counts, enabling the assessment of accuracy, precision, recall, and other metrics.

C

Contrastive Learning

A self-supervised training method where a model learns to produce similar representations for related images and different representations for unrelated ones, without needing class labels.

C

Convolutional neural networks (CNN)

CNN is a neural network that at least has one convolutional layer. It is typically used for image recognition and identification.‍

C

DETR (Detection Transformer)

An object detection architecture from Meta AI that uses learned object queries and a transformer decoder to directly predict detections, removing the need for anchor boxes and non-maximum suppression.

D

Data Labeling

The process of adding structured annotations to raw images — such as bounding boxes, segmentation masks, or class tags — so that models can learn from them. Label quality directly sets the ceiling for model performance.

D

Data Preprocessing

The steps applied to raw images and annotations before training — resizing, normalizing pixel values, converting color spaces, and transforming labels to match the model's expected input format.

D

Dataset Splitting

Dividing a labeled dataset into training, validation, and test subsets so the model can learn from one portion, tune hyperparameters on another, and be evaluated fairly on data it has never seen.

D

Deep Learning

A branch of machine learning that uses neural networks with many layers to automatically learn hierarchical features from data — raw pixels become edges, edges become textures, textures become parts, and parts become recognizable objects.

D

Depth Estimation

A computer vision task that predicts how far each pixel in an image is from the camera, producing a depth map used in autonomous driving, robotics, augmented reality, and 3D scene reconstruction.

D

Diffusion Models

A class of generative models that create images by learning to reverse a gradual noising process — starting from random static and iteratively refining it into a coherent image, as used in Stable Diffusion and DALL-E.

D

Domain Adaptation

Techniques for closing the gap between training data and real-world deployment data when they come from different conditions, such as a model trained on daytime images struggling when used at night.

D

Dropout

A regularization technique that randomly deactivates a fraction of neurons during each training step, forcing the network to spread learned features more evenly and reducing overfitting on small datasets.

D

Edge AI

Running trained models directly on local hardware — cameras, phones, industrial controllers — instead of sending data to cloud servers, enabling low-latency decisions, data privacy, and offline operation.

E

Encoder-Decoder Architecture

A neural network design where an encoder compresses input into a compact representation and a decoder expands it back to the desired output, commonly used in image segmentation with skip connections to preserve spatial detail.

E

Epoch

One complete pass through the entire training dataset. Most models train for multiple epochs, seeing each image many times, because a single pass rarely extracts all learnable patterns from the data.

E

Explainable AI

Methods and tools that show which parts of an image a model focused on when making a prediction, using techniques like Grad-CAM heatmaps or SHAP scores to make otherwise opaque decisions interpretable.

E

F1 Score

The harmonic mean of precision and recall, providing a single number that balances both metrics. Unlike a simple average, it penalizes extreme imbalances — high precision with low recall still yields a low F1.

F

Feature Pyramid Network (FPN)

A multi-scale feature extraction architecture that combines deep semantic features with shallow spatial detail through a top-down pathway, helping detection models handle objects at all sizes.

F

Fine-Tuning

Continuing to train a pre-trained model on your specific dataset so it adapts its general visual knowledge to your particular classes, image style, and domain — reducing the labeled data and training time needed.

F

Foundation Models

Foundation models are pre-trained convolutional neural networks (CNNs) that have been trained on large image datasets. These models serve as a starting point for various computer vision tasks like object detection, image classification, and segmentation. They provide a foundation of learned features and patterns that can be fine-tuned for specific vision-related applications.

F

GAN (Generative Adversarial Network)

A generative model with two competing networks — a generator that creates synthetic images and a discriminator that tries to distinguish fakes from real photos — each improving through their rivalry.

G

Generative AI

Generative AI refers to artificial intelligence systems capable of generating data, content, or objects autonomously. These systems, often based on deep learning models like GANs, can produce images, text, audio, or other forms of data, allowing them to create new and original content based on patterns learned from training data.

G

Gesture Recognition

Gesture recognition is a technology that interprets human gestures or body movements to control and interact with computers or other devices. It allows users to convey commands, input data, or interact with a system through natural movements, making it valuable in applications like gaming, virtual reality, and user interfaces.

G

Gradient Descent

The optimization algorithm that trains neural networks by computing how much each weight contributes to prediction error, then adjusting all weights in the direction that reduces that error.

G

Ground Truth

The verified, human-annotated labels — bounding boxes, masks, or class tags — that a model is trained and evaluated against. Ground truth quality sets the ceiling for how accurate a model can become.

G

Hyperparameter Tuning

The process of finding the best training configuration — learning rate, batch size, epochs, augmentation strength — since these settings control how well the model learns and must be chosen before training begins.

H

Image Embeddings

Fixed-length numerical vectors that represent an image's visual content in compact form. Similar-looking images produce vectors that are close together, enabling visual search, clustering, and duplicate detection.

I

Instance Segmentation

Instance segmentation is a computer vision task that combines object detection and semantic segmentation. It identifies and delineates individual objects within an image, assigning each pixel to a specific object instance. This provides a detailed understanding of the spatial extent and location of distinct objects in an image.

I

Intersection over Union (IoU)

A metric that measures overlap between a predicted bounding box or mask and the ground truth by dividing their intersection area by their union area. A score of 1.0 means perfect alignment; 0.0 means no overlap.

I

Keypoint Detection

Keypoint detection is a computer vision task that identifies and localizes specific points or landmarks in an image. These keypoints represent important features, such as corners or interest points, and are often used for tasks like object tracking, pose estimation, and image alignment.

K

Knowledge Distillation

A compression technique where a smaller "student" model is trained to replicate the behavior of a larger "teacher" model, transferring learned knowledge into a faster, more deployable form.

K

Learning Rate

The training hyperparameter that controls how much a model's weights change per gradient update. Too large and training becomes unstable; too small and the model converges slowly or gets stuck.

L

Loss Function

The mathematical function that measures how far a model's predictions are from the correct answers, producing the error signal that gradient descent uses to update weights during training.

L

Machine Learning Operations (MLOps)

MLOps, short for Machine Learning Operations, is a set of practices and tools that combine machine learning with DevOps to manage the end-to-end machine learning lifecycle. It encompasses model development, deployment, monitoring, and automation, enabling efficient, scalable, and reliable machine learning operations in production environments.

M

Mask R-CNN

A two-stage instance segmentation model that detects objects with bounding boxes, classifies them, and predicts a pixel-level mask for each one — identifying both what objects are present and which pixels belong to each.

M

Mean Average Precision (mAP)

The primary evaluation metric for object detection, summarizing both classification accuracy and localization quality into a single score by averaging precision across all classes and IoU thresholds.

M

Model Deployment

Model deployment is the act of making a machine learning model operational and accessible for real-world use. It involves integrating the model into a software application, cloud service, or other systems, so it can make predictions or decisions based on new data in a practical, automated, and scalable manner.

M

Model Inference

Running a trained model on new data to generate predictions — the production phase where the model outputs bounding boxes, class labels, segmentation masks, or other results from inputs it has never seen before.

M

Model Pruning

A compression technique that removes low-importance weights or entire structural units from a trained network to make it smaller and faster, with minimal impact on accuracy.

M

Model Serving

Deploying trained models behind stable APIs or inference endpoints so applications can send data and receive predictions, with production concerns like request batching, autoscaling, and version management handled automatically.

M

Multimodal Learning

Training models to understand and reason across multiple data types at once — such as images and text together — so they learn how different modalities relate to each other.

M

Neural Network

A computational model made of layers of connected nodes that learn to transform input data into useful outputs by adjusting connection weights through exposure to training examples.

N

Non-Maximum Suppression (NMS)

A post-processing step in object detection that eliminates duplicate bounding box predictions for the same object by keeping the highest-confidence detection and discarding heavily overlapping alternatives.

N

ONNX (Open Neural Network Exchange)

An open-source model format that lets you train in one framework (like PyTorch) and deploy on different hardware through a common runtime, without rewriting inference code.

O

Object Detection

Object detection is a computer vision task that involves identifying and locating objects within images or videos. It goes beyond image classification by not only classifying objects but also drawing bounding boxes around them, providing information about their positions in the image. It's widely used in applications like autonomous driving and image analysis.

O

Object Tracking

Object tracking is a computer vision process that involves monitoring and following the movement of objects within a sequence of images or a video stream over time. It assigns a unique identity to each object and tracks its position and motion as it moves through the frames, enabling applications like video surveillance and autonomous vehicles.

O

OpenCV

OpenCV (Open Source Computer Vision Library) is an open-source library of programming functions for real-time computer vision, image processing, and machine learning. Originally developed by Intel in 2000, it is now the most widely used computer vision library in the world.

O

Optical Character Recognition (OCR)

The technology that converts images of printed, handwritten, or scene text into machine-readable characters, used for document digitization, license plate reading, and receipt processing.

O

Optical Flow

The pattern of apparent motion between consecutive video frames, represented as a vector field where each pixel gets a direction and magnitude showing how it moved from one frame to the next.

O

Oriented Bounding Box (OBB)

A rotated rectangle that tightly encloses an object at an arbitrary angle, used when standard axis-aligned boxes would waste significant area on tilted targets like ships in satellite imagery or angled text.

O

Panoptic Segmentation

A unified segmentation task that labels every pixel in an image with both a class (sky, road, car) and an instance identity, combining background labeling with individual object separation.

P

Pose Estimation

Pose estimation is a computer vision task that identifies and calculates the positions and orientations of key body parts or objects within an image or video, often in the context of human pose analysis. It's used in applications such as motion capture, gesture recognition, and augmented reality.

P

Precision and Recall

Two metrics that measure detection quality from opposite angles — precision is the fraction of correct predictions out of all predictions made, recall is the fraction of actual objects the model found.

P

Quantization

A model optimization technique that reduces the numerical precision of a neural network's weights and activations - for example, converting 32-bit floating-point values to 8-bit integers (INT8). Quantization significantly reduces model size, memory usage, and inference latency, making it essential for deploying models on edge devices and mobile hardware.

Q

ROC Curve

A plot that shows a binary classifier's performance across all confidence thresholds, with the area under the curve (AUC) summarizing overall separation ability into a single number between 0.5 and 1.0.

R

Real-Time Object Detection

Models that locate and classify objects fast enough for live applications, typically at 30+ frames per second. The YOLO family has defined this space since 2015 by treating detection as a single-pass problem.

R

Regularization

A collection of training techniques — including weight decay, dropout, and data augmentation — that prevent a model from memorizing training data and instead force it to learn patterns that generalize to new inputs.

R

Retrieval-Augmented Generation (RAG)

An architecture that grounds language model responses in retrieved external documents or images rather than relying on memorized training data, reducing made-up answers and enabling responses about new information.

R

Segment Anything Model (SAM)

Meta AI's foundation model for image segmentation that produces high-quality masks from simple prompts — a point click, bounding box, or text description — without any task-specific training.

S

Self-Supervised Learning

A training approach where the model generates its own supervision from raw data — for example, masking part of an image and learning to reconstruct it — building useful visual features without human annotation.

S

Semantic Segmentation

Semantic segmentation is a computer vision task that classifies each pixel in an image to a specific object category or class. It provides a detailed understanding of the objects' spatial layout and enables the delineation of object boundaries in an image, making it useful in applications like image analysis and autonomous driving.

S

Semi-Supervised Learning

A training approach that combines a small set of labeled examples with a large pool of unlabeled data, using techniques like pseudo-labeling to extract learning signal from both.

S

Small Object Detection

The challenge of finding objects that occupy very few pixels in an image. Multi-scale feature maps, image tiling, and higher input resolutions help preserve the fine detail these small targets need.

S

Super-Resolution

Generating a higher-resolution image from a lower-resolution input, recovering fine details and sharp edges that were lost during downsampling or never captured by the sensor.

S

Synthetic Data

Artificially generated training images — created through 3D rendering, GANs, or diffusion models — with automatic annotations, used to supplement real data or cover rare scenarios that are hard to photograph.

S

TFLite / LiteRT

Google's framework for running machine learning models on mobile phones, microcontrollers, and edge devices, using quantization and hardware delegates to minimize latency and memory usage.

T

TensorRT

NVIDIA's inference optimization toolkit that converts trained models into highly efficient runtime engines for NVIDIA GPUs through layer fusion, precision calibration, and kernel auto-tuning.

T

Three-Dimensional Object Detection

Predicting the location, size, and orientation of objects in 3D space using LiDAR point clouds, depth cameras, or camera-LiDAR fusion — essential for autonomous driving and robotics where spatial understanding matters.

T

Training Metrics

Training metrics in machine learning are quantitative measures used to evaluate and assess the performance of a model during its training process. They help in understanding how well the model is learning from the data and can include metrics like accuracy, loss, precision, recall, and F1-score, among others.

T

Transfer Learning

A deep learning technique where a model trained on one task or dataset is reused as the starting point for a different but related task. Transfer learning significantly reduces training time and data requirements by leveraging learned features from large-scale pretraining, making it especially valuable when labeled data is limited.

T

Transformer

A neural network architecture built on self-attention, where each element in the input can attend to every other element. Originally designed for text, transformers now dominate computer vision and power models like ViT, DETR, and SAM.

T

U-Net

A segmentation architecture with a symmetric encoder-decoder structure and skip connections that combine deep semantic features with fine spatial detail, originally designed for biomedical image analysis.

U

Unstructured Data

Unstructured data refers to information that doesn't have a predefined format or structure, making it harder to organize and analyze using traditional data processing methods. Examples include text, images, audio, and video. Specialized techniques, like natural language processing and computer vision, are used to extract insights from unstructured data.

U

Video Understanding

The broad field of extracting meaning from video by analyzing temporal dynamics, motion patterns, and event progression over time — going beyond single-frame analysis to understand what is happening and how it changes.

V

Vision Transformers (ViTs)

Vision transformers are a class of neural networks that apply the transformer architecture, originally designed for sequence modelling tasks like language translation, to image processing tasks.

V

Vision-Language Models (VLMs)

Vision language models are multi-modal models that can learn simultaneously from images and texts to tackle many tasks, from visual question answering to image captioning.

V

Visualization

In computer vision, visualization involves creating graphical representations of visual data, aiding in tasks like image segmentation, object detection, and image classification. It helps interpret and communicate information through visual cues, such as bounding boxes, heatmaps, or feature representations, enhancing the understanding of computer vision algorithms and results.

V

YOLO

YOLO is a real-time object detection algorithm in computer vision. It processes an image once to simultaneously predict bounding boxes and class probabilities for multiple objects. YOLO is known for its speed and accuracy, making it suitable for various applications, including autonomous driving and surveillance.

Y
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
No results found.
Resources

Learn more from Datature’s team.

Get Started Now

Get Started using Datature’s platform now for free.