Articles

SAM2Long: Higher Precision in Long-Term Video Segmentation

Leonard So

November 27, 2024

MIN READ

The release of the Segment Anything Model 2 (SAM2) by Nikhila Ravi and colleagues at Meta in 2024 marked a pivotal moment in video segmentation. With its memory-based approach, SAM2 delivered groundbreaking performance in both image and video segmentation tasks. However, the complexities of long-term video analysis exposed areas for improvement, particularly in handling occlusions and reappearing objects.

In late October 2024, researchers from The Chinese University of Hong Kong and the Shanghai Artificial Intelligence Laboratory unveiled SAM2Long. This enhancement builds on SAM2’s strengths while addressing its limitations through a sophisticated, training-free memory structure. SAM2Long dramatically improves performance in long-term and complex video scenarios, setting a new benchmark for precision and robustness.

The Challenge: SAM2’s Sub-Optimal Memory Design

SAM2 revolutionized video segmentation with its memory module, which stores past frames to guide current predictions. However, this design relies on a greedy mask selection approach, which leads to critical shortcomings:

Error Accumulation: A single incorrect mask can propagate errors across subsequent frames, degrading performance.
Limited Robustness: SAM2 struggles with frequent occlusions or reappearing objects, often losing track or misidentifying targets.

SAM2 Handling Occlusion Over Time (Ding et al., 2024, Figure 1)

While SAM2 excelled in controlled scenarios with clear visual cues, these limitations become evident in real-world applications such as surveillance, autonomous driving, and sports analytics, where continuous object tracking is essential.

The Solution: SAM2Long’s Constrained Tree Memory

SAM2Long overcomes SAM2’s limitations with an innovative memory structure that enhances adaptability and error resilience. Key features include:

Robust Segmentation with Multiple Pathways

Instead of committing to a single mask per frame, SAM2Long maintains multiple segmentation hypotheses. This approach reduces error propagation, allowing the model to recover from missteps. During ambiguous frames, SAM2Long dynamically evaluates these pathways to explore various possibilities without prematurely locking into incorrect segmentations.

Cumulative Scoring for Long-Term Accuracy

SAM2Long scores segmentation pathways based on their performance over time, prioritizing those that consistently achieve high accuracy. This ensures reliable long-term segmentation by focusing on sustained performance rather than immediate results.

Occlusion Awareness

The model adapts to occlusions by emphasizing pathways that indicate object presence. This prevents temporary disruptions, such as objects passing behind others, from derailing the segmentation process.

SAM2LONG HANDLING OCCLUSION OVER TIME (Ding et al., 2024, Figure 1)

Enhanced Object-Aware Memory Bank

SAM2Long’s memory bank selectively retains high-confidence frames while filtering out noisy data. This reduces the risk of integrating errors into memory and ensures the model focuses on the most informative elements of the video.

Performance Benchmarks

Evaluation Metrics: J&F

SAM2Long’s performance is evaluated using J&F, a unified metric combining:

Jaccard Index (J): Measures overlap between predicted and ground truth masks, emphasizing area accuracy.
F-score (F): Focuses on contour alignment, evaluating the shape of segmented objects.

Performance Overview

J&F PERFORMANCE COMPARISON WITH STATE-OF-THE-ART METHODS ON SA-V DATASET (Ding et al., 2024)

SAM2Long consistently outperforms SAM2 across six VOS datasets, including SA-V, LVOS v1 and v2, MOSE, VOST, and PUMaVOS.

SA-V: SAM2Long achieves improvements in J&F ranging from 2.1 to 5.3 over SAM2. The largest benefits are seen in medium-sized models, indicating that additional memory support enhances these architectures.
Other Datasets: SAM2Long shows a consistent average improvement of 1-2 points in J&F, demonstrating robust performance across diverse scenarios.
Comparison to State-of-the-Art: SAM2Long surpasses methods like XMem, DeAOT, and STCN by significant margins, establishing itself as the top-performing approach.

Performance and Latency Analysis

SEGMENTATION PERFORMANCE COMPARISON ACROSS FRAMES (Ding et al., 2024)

From our testing, using a NVIDIA Geforce RTX 4070 GPU, we achieved an average of 85-95 ms per frame for both strategies with SAM 2 Base+, demonstrating that the additional training-free method proposed by SAM2Long using their optimal hyperparameters of 3 pathways does not have significant impact on the compute and latency.

‍

It should be noted that with increased memory pathways, the compute overhead does increase in a way that does impact latency, going to 125 ms for 5 memory pathways, which is a 50% increase, that seems to grow linearly with the number of pathways.

‍

SAM2Long's advancements translate into tangible performance improvements, as evidenced by its evaluation across six Video Object Segmentation (VOS benchmarks ) datasets, SA-V (Segment Anything), LVOS v1 and v2 (Long VOS), MOSE, VOST, and PUMaVOS based on the following scores, J for region similarity and F for contour accuracy: SA-V and LVOS Benchmarks: SAM2Long achieved up to a 5.3-point improvement in J&F scores, excelling in complex, occlusion-heavy video sequences.

These benchmarks are particularly challenging due to the inclusion of long-term object reappearances and heavy occlusions.

SA-V: At all model sizes, SAM2.1Long outperforms SAM2.1 in J&F ranging from 0.8 to 3.5, whereas SAM2Long outperforms SAM2 ranging from 2.1 to 5.3, indicating that the gap has closed with SAM2’s updated weights, although SAM2Long maintains an improvement. Additionally, the largest benefits appear at the medium model sizes such as S and B+, which may indicate that the information being saved across medium sized architectures requires additional memory support in order to maximize the capabilities of the native architecture.
SA-V: With regards to other prior state-of-the-art methods, SAM2 and SAM2Long stand out as significant improvements for VOS, notably outperforming all other methods such as XMem, DeAOT, and STCN by at least 17.
Other Datasets: SAM2Long outperforms SAM2 across the board with the other datasets, but in smaller margins, with differences on average of 1-2. J&F: Performance improvements regarding J and F individually seem fairly balanced, demonstrating an overall quality improvement.

Optimizing SAM2Long Performance Parameters

Increasing memory pathways from 1 to 3 significantly enhances performance by enabling the model to explore multiple hypotheses, with J&F improving from 76.3 to 80.8. However, beyond three pathways, performance plateaus, striking a balance between accuracy and computational cost. Similarly, an IoU threshold (δiou) of 0.3 achieves optimal performance by effectively filtering unreliable frames while retaining valuable data, yielding a J&F score of 80.8 compared to 77.8 at a stricter threshold of 0.9. For the uncertainty threshold (δconf), setting it to 2 provides the best balance between confidence and diversity, avoiding premature errors or unnecessary complexity. Finally, a slight modulation of memory attention within the range of [0.95, 1.05] enhances reliability without introducing excessive computation, ensuring the highest performance across metrics.

Next Steps

SAM2Long sets a new standard for video segmentation in the field with its innovative, training-free approach that enhances the foundational architecture of SAM2. This advancement guarantees robust performance in complex video scenarios, addressing the challenges of long-term tracking with unprecedented accuracy. As industries increasingly rely on reliable video segmentation for applications such as surveillance and autonomous driving, the implications of SAM2Long's capabilities are profound.

If you have questions, feel free to join our Community Slack to post your questions or contact us to train your own SAM-2 Model on Datature Nexus.

For more detailed information about the model functionality, customization options, or answers to any common questions you might have, read more on our Developer Portal.

‍

The Challenge: SAM2’s Sub-Optimal Memory Design

Error Accumulation: A single incorrect mask can propagate errors across subsequent frames, degrading performance.
Limited Robustness: SAM2 struggles with frequent occlusions or reappearing objects, often losing track or misidentifying targets.

The Solution: SAM2Long’s Constrained Tree Memory

SAM2Long overcomes SAM2’s limitations with an innovative memory structure that enhances adaptability and error resilience. Key features include:

Robust Segmentation with Multiple Pathways