The Segment Anything Model 2 (SAM-2) revolutionized video object segmentation (VOS) by introducing a memory-based architecture that processes sequential frames while maintaining object coherence. Its ability to generate pixel-level masks has transformed applications across surveillance, autonomous vehicles, and sports analytics. However, as the complexity of real-world scenarios increases, SAM-2’s limitations become apparent—most notably in handling occlusions, reappearing objects, and error propagation.
To overcome these challenges, derivative models like SAMURAI and SAM2Long have emerged. These models leverage advanced memory and motion-aware mechanisms to enhance segmentation accuracy and long-term tracking. This article delves into their innovations, evaluates their performance against benchmarks, and explores the future of VOS.
SAM-2's Core Limitations
We’ve previously reviewed the full extent of SAM 2’s capabilities when it was first launched, with this blog here. At this early stage, we did identify some of its shortcomings and general limitations. While SAM-2 has been a significant improvement from the original SAM, other key areas in its foundational design have constraints that hinder its performance, particularly in the video tracking space that we list below.
Error Accumulation
SAM-2 relies on a greedy mask selection mechanism, choosing the mask with the highest predicted Intersection over Union (IoU) score for each frame. This approach is brittle in ambiguous scenarios. For instance, in crowded scenes, where multiple objects resemble the target, an incorrect selection introduces errors that cascade across subsequent frames.
Handling Occlusions
SAM-2 struggles with occlusion-heavy scenarios, often defaulting to appearance similarity rather than considering motion and temporal cues. This limitation is critical for applications such as traffic monitoring, where objects frequently overlap.
Real-World Robustness
In fast-moving, high-fidelity video environments, SAM-2's fixed memory mechanism often encodes irrelevant or low-confidence features, further degrading its reliability. While SAM-2 excelled in controlled scenarios with clear visual cues, these limitations become evident in real-world applications such as surveillance, autonomous driving, and sports analytics, where continuous object tracking is essential.
Other Approaches to Video Object Segmentation (VOS)
Video Object Segmentation (VOS) involves the challenging task of tracking and segmenting objects across video frames. Real-world conditions—such as occlusions, object reappearances, and dynamic transformations—make this especially complex. Memory-based techniques have become central to addressing these challenges by leveraging prior frame information to guide segmentation decisions. SAM-2 represents a foundational step in this evolution, introducing memory-based segmentation for maintaining object coherence across frames; however, its approach of selecting masks based solely on IoU scores and using a static memory structure limits its adaptability in dynamic and occlusion-heavy scenarios.
Existing VOS Methods
To fully appreciate the advances brought by SAMURAI and SAM2Long, it is helpful to examine how existing memory-based VOS methods have addressed similar challenges and where they fall short. By understanding the strengths and weaknesses of these approaches, we can better contextualize the innovations introduced by the SAM-2 derivatives. Here’s a few of them:
- XMem: A prominent model in the field, XMem employs a hierarchical memory structure that captures fine-grained details and long-term dependencies. This enables it to handle occlusions and object reappearances effectively.
- Cutie: A derivative of XMem, Cutie improves segmentation by incorporating object-level features, offering adaptability in complex, dynamic environments.
- Space-Time Memory Networks (STM): Among the earliest memory-based approaches, STM paved the way for temporal modeling in VOS but lacked robustness for highly dynamic scenarios.
- DEVA and SwinB-AOT: These models enhance spatial and temporal relationships but struggle with long-term tracking and occlusion-heavy videos, limiting their applicability in more complex cases.
The Need for New Approaches
While SAM-2 introduces a foundational memory-based method for VOS, its reliance on a single segmentation pathway and fixed memory structure highlights the need for more flexible solutions. Models like XMem offer hierarchical memory solutions, but they do not address error recovery in long-term sequences. These limitations underline the importance of approaches that can handle diverse challenges, from rapid object movement to sustained occlusions.
Advances in SAMURAI and SAM2Long
SAMURAI: Motion-Aware Memory for Robust Tracking
SAMURAI builds upon SAM-2 with two key innovations: motion modeling and motion-aware memory selection. These two methods are both training free and can be utilized independently in some way. While not necessarily novel in the realm of tracking, the effectiveness seems to come from the utilization of well understood methods such as Kalman filters as well as combining contextual real world metrics with SAM 2’s self-evaluation of its predictions to better track objects.
Motion Modeling
Motion modeling is fundamentally based on linear Kalman filters. The Kalman filter is used to select the ideal masks from the multiple mask candidates generated by the SAM 2 model. The criteria used within each state are the bounding box surrounding the mask as well as the velocities with respect to their location and dimensions. The selected mask is chosen based on maximizing the weighted sum of the IoU score between the predicted location of the bounding box from the Kalman filter and the original affinity score generated from SAM 2.
Additionally, this module is only used when the object has been consistently tracked in the past several frames, so that the effect of tracking doesn’t overwhelm short instances of detection from SAM 2 to allow for a balance of SAM 2’s innate capability for detecting similar objects and actually tracking realistic motion of objects.
Selective Memory Updates
SAMURAI optimizes the memory bank by retaining only the most relevant frames, reducing noise introduced by low-confidence masks. This is done by selecting frames for memory of their mask, objectness, and Kalman filter scores that meet sufficient thresholds. Similarly to SAM 2, there is still a maximum number of frames for look back, but instead of just looking at the chronologically most recent frames, they contain more solid examples of the object.
SAM2Long: Constrained Memory Tree for Long-Term Segmentation
Robust Segmentation with Multiple Pathways
Instead of committing to a single mask per frame, SAM2Long maintains multiple segmentation hypotheses. This approach reduces error propagation, allowing the model to recover from missteps. During ambiguous frames, SAM2Long dynamically evaluates these pathways to explore various possibilities without prematurely locking into incorrect segmentations.
Cumulative Scoring for Long-Term Accuracy
SAM2Long scores segmentation pathways based on their performance over time, prioritizing those that consistently achieve high accuracy. This ensures reliable long-term segmentation by focusing on sustained performance rather than immediate results.
Occlusion Awareness
The model adapts to occlusions by emphasizing pathways that indicate object presence. This prevents temporary disruptions, such as objects passing behind others, from derailing the segmentation process.
Enhanced Object-Aware Memory Bank
SAM2Long’s memory bank selectively retains high-confidence frames while filtering out noisy data. This reduces the risk of integrating errors into memory and ensures the model focuses on the most informative elements of the video.
Comparing SAMURAI and SAM2Long
While both models significantly enhance SAM-2’s capabilities, they diverge in their approach and performance focus:
- Key Strengths: SAMURAI excels in handling fast-moving objects and crowded scenes due to its motion-aware memory scoring, as well as a better capability for maintaining a strong sense of object appearance. SAM2Long performs better in long-term video sequences due to its usage of memory trees to carefully work backwards on viable tracks.
- Performance Trade-offs: While both SAMURAI and SAM2Long maintain relatively high performance with the default hyperparameters, SAMURAI offers real-time capabilities with minimal computational overhead, making it ideal for high-speed applications. SAM2Long introduces slightly higher latency due to its memory tree but achieves higher accuracy in occlusion-heavy scenarios.
- Limitations: SAMURAI was fundamentally designed to do single object tracking, although multi-object tracking is a much more common and applicable use case. SAM2Long does not have hyperparameters that can be easily fine-tuned as they can become rapidly computationally infeasible.
Evaluation Metrics
J&F
SAM2Long’s performance is evaluated using J&F, a unified metric combining:
- Jaccard Index (J): Measures overlap between predicted and ground truth masks, emphasizing area accuracy.
- F-score (F): Focuses on contour alignment, evaluating the shape of segmented objects.
Area Under the ROC Curve (AUC)
Area Under the ROC Curve, as the name describes, is the absolute area under the curve represented by the model in its capability to achieve true positive rate vs. false positive rate across several thresholds. The way this is computed in computer vision models is through a base threshold that determines whether a prediction sufficiently overlaps with its ground truth label to be considered a true positive prediction. As the threshold changes, the true positive rate vs. false positive rate will change and this creates a curve from which the area can be calculated.
P
norm
/ P
Pnorm and P are computed from the same systems as above, and are the normalized precision and regular precision. Precision is the true positive count divided by the sum of true positive and false positive count, as computed using the same systems above.
SAM2Long's Performance
SAM2Long displays strong performance in both single object and multi-object tracking. Due to the extensive ability for memory trees to encode longer potential sequences of movement, SAM2Long is able to handle occlusion quite well, even in challenging, complex environments. However, as it doesn’t leverage so much upon physical features in controlling its predicted outputs, it still can suffer from incongruous tracking results if SAM-2 is unable to consistently provide meaningful predictions, which can happen in complex movements or rapid alterations of an object’s features.
SAM2Long's advancements translate into tangible performance improvements, as evidenced by its evaluation across six Video Object Segmentation (VOS benchmarks) datasets, SA-V (Segment Anything), LVOS v1 and v2 (Long VOS), MOSE, VOST, and PUMaVOS.
- SA-V: SAM2Long achieves improvements in J&F ranging from 2.1 to 5.3 over SAM-2. The largest benefits are seen in medium-sized models, indicating that additional memory support enhances these architectures.
- Other Datasets: SAM2Long shows a consistent average improvement of 1-2 points in J&F, demonstrating robust performance across diverse scenarios.
- Comparison to State-of-the-Art: SAM2Long surpasses methods like XMem, DeAOT, and STCN by significant margins, establishing itself as the top-performing approach.
For SA-V in particular, at all model sizes, SAM2.1Long outperforms SAM2.1 in J&F ranging from 0.8 to 3.5, whereas SAM2Long outperforms SAM-2 ranging from 2.1 to 5.3, indicating that the gap has closed with SAM-2’s updated weights, although SAM2Long maintains an improvement. Additionally, the largest benefits appear at the medium model sizes such as S and B+, which may indicate that the information being saved across medium sized architectures requires additional memory support in order to maximize the capabilities of the native architecture.
Performance and Latency Analysis
From our testing, using a NVIDIA Geforce RTX 4070 GPU, we achieved an average of 85-95 ms per frame for both strategies with SAM 2 Base+, demonstrating that the additional training-free method proposed by SAM2Long using their optimal hyperparameters of 3 pathways does not have significant impact on the compute and latency.
It should be noted that with increased memory pathways, the compute overhead does increase in a way that does impact latency, going to 125 ms for 5 memory pathways, which is a 50% increase, that seems to grow linearly with the number of pathways.
Optimizing SAM2Long Performance Parameters
Increasing memory pathways from 1 to 3 significantly enhances performance by enabling the model to explore multiple hypotheses, with J&F improving from 76.3 to 80.8. However, beyond three pathways, performance plateaus, striking a balance between accuracy and computational cost. Similarly, an IoU threshold (δiou) of 0.3 achieves optimal performance by effectively filtering unreliable frames while retaining valuable data, yielding a J&F score of 80.8 compared to 77.8 at a stricter threshold of 0.9. For the uncertainty threshold (δconf), setting it to 2 provides the best balance between confidence and diversity, avoiding premature errors or unnecessary complexity. Finally, a slight modulation of memory attention within the range of [0.95, 1.05] enhances reliability without introducing excessive computation, ensuring the highest performance across metrics.
SAMURAI's Performance
SAMURAI was primarily focused on the single object tracking task. As such, SAMURAI’s provided large scale benchmarks were performed on fundamentally different datasets. In this more restrictive space, SAMURAI demonstrates state-of-the-art performance and even shows significant improvement over SAM-2.1. The benefits of SAMURAI also appear to increase with the size of the model, indicating that the more the underlying model is able to identify the correct object in some way, the more capable SAMURAI is in denoising the other mask predictions.
Performance and Latency Analysis
From our testing, using a NVIDIA Geforce RTX 4070 GPU, we achieved an average of 85-95 ms per frame for both strategies with SAM 2 Base+, demonstrating that the additional training-free method proposed by SAM2Long using their optimal hyperparameters of 3 pathways does not have significant impact on the compute and latency.
Both modules in SAMURAI are demonstrably valuable, with 2-4% improvement on the baseline SAM 2, although the motion module improvement is slightly smaller than the memory module improvement. This seems indicative that both modules are helpful, but the traditional Kalman filter based techniques only have so much value before the intentionality of the memory bank storage comes into play.
Optimizing SAMURAI Performance Parameters
With SAMURAI, playing with the thresholds between using the Kalman filter predicted motion vs. the raw alignment scores from SAM 2 will certainly impact SAMURAI’s ability to track objects. For rapidly evolving objects, leaning towards the Kalman filter would be more effective whereas for consistently moving objects with similar objects with more occlusion might benefit more from using SAM 2 to determine affinity. Additionally, increasing thresholds such that the memory bank is more exclusive will reduce noisy predictions, but reduce the amount of frames where tracking occurs. Overall, one should experiment with these to determine what is the most appropriate.
Shortcomings and Future Directions
While SAMURAI and SAM2Long significantly enhance VOS, areas for improvement remain:
Broadened Scope
- SAMURAI’s code implementation is currently limited to single object tracking without much ability to refine the mask predictions through additional annotations, due to their alteration of SAM-2’s source code. For SAMURAI to be able to achieve full efficacy, it would need to regain the full utility that SAM-2 provides.
Computational Overhead
- SAM2Long’s constrained memory tree introduces additional latency, particularly with increased pathway counts. Optimizing this tradeoff between accuracy and speed is a key research direction.
Extreme Occlusions
- Both derivatives struggle with scenarios involving extreme occlusions or very small objects, where even memory-aware mechanisms cannot maintain reliable tracking.
Future Direction
- Real-Time Adaptability: Incorporating real-time learning mechanisms could enable models to adjust dynamically to changing environments.
- K-Shot Learning: Expanding the derivatives’ capacity to learn from minimal labeled data can open new possibilities in low-resource scenarios.
- Integration with 3D Data: Leveraging depth information or multi-modal inputs could further enhance segmentation in occlusion-heavy environments.
Build Your Own Custom Model
SAMURAI and SAM2Long exemplify the evolution of video object segmentation. By addressing SAM-2’s core limitations, these derivatives offer enhanced accuracy, robustness, and adaptability, setting new standards for VOS applications.
If you have questions, feel free to join our Community Slack to post your questions or contact us to train your own Computer Vision Model on Datature Nexus.
For more detailed information about the model functionality, customization options, or answers to any common questions you might have, read more on our Developer Portal.
Build models with the best tools.
develop ml models in minutes with datature