Vision-language models (VLMs) have emerged as transformative tools for solving complex AI challenges that integrate visual data with natural language understanding. Applications span industries like healthcare, finance, and robotics, solving problems such as generating detailed captions, analyzing tables, and even interpreting medical imagery.
At the forefront of this innovation is the PaliGemma family, designed to maximize efficiency and performance in fine-tuning across diverse tasks. With the release of PaliGemma 2, researchers and practitioners gain access to a VLM that pushes the boundaries of scalability, resolution, and versatility.
This article provides a deep dive into PaliGemma 2, explaining its architecture, advancements, and performance benchmarks while offering insights into its applications for machine learning engineers.
A Brief Look Back at PaliGemma
The original PaliGemma combined the SigLIP-400m vision encoder with the Gemma-2B language model to create a versatile, compact vision-language model (VLM) under 3B parameters. This design allowed it to perform effectively across a wide range of tasks, including standard benchmarks like COCO captioning and VQA, as well as more specialized tasks such as remote-sensing QA and segmentation.
While it achieved strong performance, PaliGemma had limitations.
- Resolution Constraints: It was pre-trained at 224px², which limited its ability to handle tasks requiring higher image resolutions, like fine-grained detection or OCR.
- Scalability: Despite its efficiency, the model’s small scale meant it couldn’t handle more complex tasks that benefited from larger architectures, such as PaLI-X or PaLM-E.
- Task Specialization: While fine-tuning was effective, PaliGemma required significant adjustments for highly domain-specific tasks.
- Lack of Instruction Tuning: Without instruction tuning, PaliGemma was less adaptable for certain real-world applications that demand specific task instructions.
Despite these challenges, we demonstrated in a previous article that PaliGemma, a smaller VLM could achieve competitive performance across a broad range of tasks, paving the way for PaliGemma 2.
What is PaliGemma 2?
PaliGemma 2 builds upon the architectural and conceptual foundations of its predecessor, introducing key enhancements to address scalability, task adaptability, and fine-tuning flexibility.
Scalable Model Variants
PaliGemma 2 is offered in three sizes:
- 3B Parameters: Compact and resource-efficient, ideal for constrained environments.
- 10B Parameters: A balanced option for mid-tier computational setups.
- 28B Parameters: Designed for high-performance tasks with no latency constraints.
This configurability ensures users can choose a model tailored to their compute resources and task requirements.
Multi-Resolution Support
The inclusion of resolutions at 224px², 448px², and 896px² enables PaliGemma 2 to process images with varying levels of detail. For instance:
- Low resolutions (224px²) are suitable for general image-captioning tasks.
- Higher resolutions (896px²) excel in intricate domains like document OCR or radiography.
This flexibility marks a significant improvement over PaliGemma, which was limited to a single resolution.
Modular Training Pipeline
PaliGemma 2 introduces a three-stage pretraining process:
- Stage 1: Uses 224px² images to establish foundational multimodal capabilities, with a mixture of captioning, detection, and OCR tasks.
- Stage 2: Trains on higher resolutions (448px² and 896px²) for specialized tasks requiring detailed analysis, such as table structure recognition.
- Stage 3: Fine-tunes on domain-specific datasets, ensuring optimal performance for tasks like radiography reporting and molecular structure recognition.
Key Features at a Glance
- Transferability: Fine-tuned to over 40 benchmarks, covering diverse domains.
- Efficiency: Reduced training costs despite supporting larger model sizes and higher resolutions.
- Open Weights: Available for customization and integration into workflows, making it accessible for practitioners and researchers.
These advancements make PaliGemma 2 not only more versatile but also more applicable to real-world challenges across a variety of industries.
What’s new in PaliGemma 2?
Expanding the Scope of Vision-Language Models
PaliGemma 2 extends the boundaries of what VLMs can achieve:
- Diverse Task Support: Tackles new challenges like optical music score recognition, spatial reasoning, and long-form image captioning.
- Fine-Grained Understanding: Incorporates high-resolution inputs to better understand complex visual relationships, such as tables and molecular graphs.
- Domain-Specific Applications: Excels in medical and scientific domains with tailored pretraining and fine-tuning techniques.
For instance, in molecular structure recognition, PaliGemma 2 achieved a 94.8% exact match score, surpassing previous models like MolScribe.
Improved Pretraining Recipe
PaliGemma 2’s three-stage pre-training pipeline is a key differentiator:
- Stage 1: Establishes a strong base with a task mixture of captioning, VQA, and referring expression comprehension.
- Stage 2: Introduces specialized tasks requiring detailed visual comprehension, such as OCR and table structure analysis. Tasks benefiting from higher resolution are upweighted.
- Stage 3: Optimizes domain-specific performance, ensuring the model excels in benchmarks like MIMIC-CXR for radiography reports.
By tailoring tasks to each stage, the model achieves a balance between general-purpose learning and task-specific optimization.
Enhanced Scalability and Resource Efficiency
Unlike many larger VLMs that demand extensive compute resources, PaliGemma 2 achieves efficiency through architectural refinements. For example:
- The 3B variant delivers competitive performance on common tasks, requiring only a fraction of the compute needed for the 28B model.
- The training process leverages TPUv5e pods for efficiency, with low-resolution training stages reducing FLOP costs significantly.
Benchmarks That Speak Volumes
PaliGemma 2 consistently outperforms its predecessor and contemporary models across key benchmarks:
- OCR Tasks: On Total-Text, PaliGemma 2’s F1 score of 74.2 at 896px² exceeds prior state-of-the-art models.
- Table Recognition: Achieved near-perfect Grid Table Similarity (GriTS) scores on PubTabNet, with substantial improvements in Tree Edit Distance Similarity (TEDS).
- Long Captioning: Generated richer, more factual captions than models like InstructBLIP, reducing non-entailment sentence errors by over 20%.
PaliGemma vs. PaliGemma 2 Comparison
To get a better understanding of how the change in architecture has influenced the performance, we’ve performed a variety of tasks on the same set of images in varying industries. We use PaliGemma Mix 448x448 checkpoint compared with the PaliGemma 2 DOCCI fine-tuned
PaliGemma:Prompt: detect giraffe
Output: <loc0210><loc0125><loc0937><loc0641> giraffe ; <loc0157><loc0529><loc0933><loc0802> giraffe
PaliGemma 2:Prompt: detect giraffe
Output: <loc0205><loc0129><loc0933><loc0623> giraffe ; <loc0159><loc0512><loc0935><loc0790> giraffe
In this image, we tried two prompts, one for raw detection of the giraffes. Both models detect two giraffes out of three, with the detections being centered around the two giraffes at the front. However, when asked “How many giraffes’ legs are in the photo?”, PaliGemma accurately states 12, whereas PaliGemma 2 states 3. From our view, it seems that this fine-tuned checkpoint of PaliGemma 2 seems to lean towards the visual information when generating its outputs. A notable aspect of PaliGemma was that its visual outputs such as bounding boxes didn't match the semantic meanings of its textual outputs, such as outputting 2 bounding boxes, but saying there are 3 of the same object. Our hypothesis is that in constructing a dataset and methodology that enhances more unified and consistent logic between the outputs, this version of PaliGemma 2 leans too heavily on its raw visual outputs and doesn't incorporate the textual context from the prompt sufficiently, so when giving the count of the legs, the main visual cue is the presence of giraffes, so it instead gives an accurate count of the giraffes instead.
In this image, we asked the models to describe the image in general. PaliGemma 2 outputs “Black and white image of the shoulder and upper arm.” PaliGemma outputted “broken shoulder”. When asked what part of the body the fracture occurred in, both models correctly identified that it was the shoulder. From this, it seems that PaliGemma 2, possibly due to the dataset it was fine tuned with, leans more towards more descriptive outputs, where PaliGemma is more focused on directly responding to the question.
For this image, we first asked through text for the models to output how many traffic lights there are. Both models output that there are three traffic lights. When asked to detect them with bounding boxes, PaliGemma 2 shows two meaningfully different bounding boxes whereas PaliGemma only shows one meaningful bounding box. This shows that PaliGemma 2 is more well adjusted to the unique bounding box and segmentation tokens, and shows a clear advantage that we noted in other images as well.
Performance Benchmarks
Advancing Vision-Language Understanding
The results from PaliGemma 2’s benchmark tests reflect its ability to handle complex, high-stakes tasks:
- Medical Imaging: Achieved state-of-the-art RadGraph F1 scores on the MIMIC-CXR dataset, demonstrating potential for clinical diagnostics.
- Molecular Structure Recognition: Outperformed MolScribe with a 94.8% exact match score, showcasing its utility in chemistry and drug discovery.
- Multimodal Tasks: Delivered improved results on tasks like AOKVQA and NLVR2, leveraging its large language model for advanced reasoning.
The Importance of Model Size and Resolution
Analysis reveals how model size and resolution interact:
- Higher resolutions (448px², 896px²) are critical for tasks involving intricate details, such as document layout analysis and text detection.
- Larger models (10B, 28B) provide superior results for tasks requiring deeper language understanding, like spatial reasoning and long-form captioning.
A thoughtful balance between these parameters allows users to tailor the model to their unique requirements, whether constrained by computational limits or task complexity.
Applications and Real-World Impact
PaliGemma 2’s flexibility and high performance enable practical solutions across industries, transforming workflows and enabling new capabilities.
Healthcare
PaliGemma 2 automates radiology report generation by analyzing medical scans, such as chest X-rays, to produce detailed diagnostic narratives. Fine-tuned on the MIMIC-CXR dataset, it achieves a RadGraph F1 score of 29.5, surpassing prior models and significantly reducing report turnaround time while improving diagnostic consistency.
Scientific Research
In molecular chemistry, PaliGemma 2 accelerates drug discovery by converting molecular diagrams into SMILES strings with a 94.8% exact match rate. This eliminates manual transcription errors, enabling researchers to focus on simulations and analysis, making it a valuable tool for chemical and pharmaceutical industries.
Finance and Business
With fine-tuning on datasets like PubTabNet, PaliGemma 2 excels in extracting data from complex tables, generating structured outputs like HTML representations. This automates financial report processing, reducing manual errors and speeding up data integration for key business insights.
Accessibility and Content Creation
PaliGemma 2’s ability to generate detailed captions, including spatial relationships and object counts, enhances multimedia accessibility for visually impaired users. On the DOCCI dataset, it reduced factual errors by 20%, enabling platforms to produce more accurate and meaningful descriptions at scale.
Creative Industries
In music transcription, PaliGemma 2 accurately converts sheet music into kern format, achieving a 59% reduction in error rates compared to prior models. This makes it an invaluable tool for digitizing historical scores or creating editable formats for composers and archivists.
What’s Next for Vision-Language Models?
Tackling Compute Constraints
Future VLMs must address the growing demand for compute efficiency. Techniques like mixed-precision training and adaptive fine-tuning strategies can reduce costs without sacrificing performance.
Beyond 2D Vision-Language Tasks
The next frontier lies in integrating 3D visual data, expanding capabilities to domains like augmented reality and robotics. PaliGemma 2 sets a foundation for this evolution with its scalable architecture.
Bridging the Research-Industry Gap
While models like PaliGemma 2 are powerful, their true potential lies in practical deployments. Developing user-friendly fine-tuning pipelines and pre-trained domain-specific weights will make VLMs more accessible to industry professionals.
Build Your Own Custom Model
PaliGemma 2 underscores the potential of vision-language models to redefine AI’s capabilities. By improving scalability, task coverage, and fine-tuning flexibility, it paves the way for even greater innovation. Whether you’re an ML engineer exploring its applications or a researcher building the next breakthrough, PaliGemma 2 is a model worth diving into.
If you have questions, feel free to join our Community Slack to post your questions or contact us to train your own Computer Vision Model on Datature Nexus.
For more detailed information about the model functionality, customization options, or answers to any common questions you might have, read more on our Developer Portal.
Build models with the best tools.
develop ml models in minutes with datature