Introducing Florence-2: Microsoft’s Latest Multi-Modal, Compact Visual Language Model

Trevor Carrell
Editor

What is Florence-2?

Florence-2 is Microsoft's new visual language model (VLM) designed to handle diverse tasks such as object detection, segmentation, image captioning, and grounding, all within a single unified model. With a significantly reduced model size compared to state-of-the-art (SOTA) vision-language models, Florence-2 establishes itself as a new SOTA zero-shot model. Additionally, fine-tuning Florence-2 enables it to remain competitive with specialist models, producing SOTA results on several tasks. Florence-2 achieves this by using a multi-task learning objective and extensive visual annotations on an autonomously generated dataset, FLD-5B, containing 5.4 billion annotations on 126 million images.

Multi-Task Model

Traditional models that excel at singular tasks, such as object detection, semantic segmentation, and image captioning, often rely on a task-specific architectural design. This paradigm, however, limits these models to single-task usage, necessitating multiple models to tackle a multi-task problem. Florence-2 addresses this limitation by introducing a unified, task-agnostic model using transformer-based architecture. This innovative design allows a single model to operate across various tasks without requiring changes to its architecture.

Architecture

Florence-2 is a transformer model composed of encoder and decoder blocks that maps an input sequence to an output sequence. An image and a set of multi-task prompts are fed into the model. The image is processed by the vision encoder Dual Attention Vision Transformer (DaViT), created by Ding et al. (2022), converting it to an embedding space of (Nv, Dv), where Nv is the number of vision tokens and Dv represent the dimensionality of the vision tokens, which for the primary vision encoder used, DaViT-Base, is 1024. The multi-task prompts, which include task instructions like “What does the image describe?”, are converted into text prompt embeddings with shape (Nv, D) using Florence-2’s extended language tokenizer and word-embedding layer. These vision embeddings are then linearly projected and normalized to shape (Nt, D) and concatenated with the text prompt embeddings. 

Then, Florence-2 uses a standard encoder-decoder transformer architecture to process this input sequence, generating a new sequence that encodes the task annotation as the output for each task-specific prompt. For the base model, there are 6 encoder layers and 6 decoder layers, and for the large model, 12 encoder layers and 12 decoder layers.

Task Formulation

Depending on the task-specific prompt, the prompt and response can either be text-based or region-based. If the task is text-based, then the prompt and response will be text-based. However, if the task is region-based, 1000 location tokens, each representing a region in the image, are included in Florence-2’s language tokenizer’s vocabulary list. The 1000 location tokens represent pixel coordinates placed into 1000 bins. As we are primarily interested in the vision based tasks, we display examples of vision tasks below, as natural language outputs are simply text tokens. Depending on task requirements, the sequence format is as follows:

  • Box Representation: (x0, y0, x1, y1), where (x0, y0) the top-left coordinate of the box and (x1, y1) is the bottom-right coordinate, with each point encoded as a location token in the input sequence. Box representation is used for tasks such as object detection and dense region captioning. An example annotation would be an input caption of "A man with an orange helmet", and output being:

{
    "<CAPTION_TO_PHRASE_GROUNDING>": {
        "bboxes": [
            [
                96.31999969482422,
                8.697000503540039,
                529.5999755859375,
                427.4909973144531,
            ],
            [
                242.87998962402344,
                8.697000503540039,
                425.91998291015625,
                139.8209991455078,
            ],
        ],
        "labels": ["a man", "an orange helmet"],
    }
}
  • Quad-Box Representation: (x0, y0,..., x3, y3), where (xi, yi) is a corner coordinate of the quadrilateral, represented as a location token in the input sequence. Quad-box representation is used for tasks such as text detection and recognition. An example annotation would be an input caption of “How many items are sold?”, and the output being:

{
    "<OCR_WITH_REGION>": {
        "quad_boxes": [
            [
                634.2000122070312,
                71.4000015258789,
                1199.4000244140625,
                233.40000915527344,
                1199.4000244140625,
                293.4000244140625,
                619.800048828125,
                124.20000457763672,
            ]
        ],
        "labels": ["# ITEMS SOLD 5"],
    }
}
  • Polygon Representation: (x0, y0,..., xn, yn), where (xi, yi) is a point coordinate of the polygon, represented as a location token in the input sequence. Polygon representation is used for tasks such as segmentation. It is important to note that the pre-training of the model primarily involved class agnostic segmentation, but chained inputs can be used to obtain class relevant segmentation masks. An example annotation could be an input caption of `orange helmet`, and the output being:

{
    "<REFERRING_EXPRESSION_SEGMENTATION>": {
        "polygons": [
            [
                [
                    260.1600036621094,
                    65.78500366210938,
                    262.0799865722656,
                    63.55500030517578,
                    263.3599853515625,
                    60.43300247192383,
                    265.2799987792969,
                    57.75699996948242,
                    ...
                ]
            ]
        ],
        "labels": ["orange helmet"],
    }
}

Model Loss Function

The optimization objective for Florence-2 is the standard loss function for language modelling, with cross-entropy loss across the tokens. It primarily focuses on differences in token generation between the predicted tokens and ground truth tokens. For each token in the ground truth token sequence, the previous ground truth tokens are provided as context for the model and probability distribution for the next predicted token is used to determine how accurately the model can predict the next token. Specifically, the loss is defined as below, where theta represents the neural network’s parameters, and the summation iterates over each of the target tokens in sequence:

This aligns with previous works such as PaliGemma, in which there is no diversity or differentiation between the training of different computer vision tasks. As discussed in our previous article, there are clear benefits and limitations, with a unified loss allowing for simpler integration of datasets across varying tasks, but restricting the model’s ability or demonstrating its inability to finetune itself precisely to specific tasks.

Autonomously Generated Dataset

Since Florence-2 is a unified model, it requires a large pretraining dataset with a highly diverse set of multi-task annotations. However, such data is scarce. To address this, the authors of Florence-2 developed FLD-5B, a large multi-task dataset composed of 126 million images featuring 500 million text annotations, 1.3 billion text-region annotations, and 3.6 billion text-phrase annotations across multiple tasks. FLD-5B was composed in two steps: data collection and data annotation.

Data Collection

To construct FLD-5B, the authors sourced imagery from prominent datasets that serve as primary sources for three key tasks: image classification, object detection, and image captioning. These datasets include ImageNet-22k, Object 365, Open Images, Conceptual Captions, and LAION. After filtering, these sources provided the 126 million images that compose FLD-5B.

Data Annotation

A key feature of FLD-5B is that annotations are largely synthetic, as opposed to human-generated. To generate annotations, the author’s employ a three-step process: initial annotation with specialist models, data filtering and enhancement, and iterative data refinement.

Initial Annotation with Specialist Models

Specialist models are models that are task-specific and are specifically tailored to excel in a given task. As such, the outputs that the specialist models produce are highly reliable. The dataset leverages these specialist models to initialize annotations for its data across a number of tasks. Moreover, if human-generated annotations exist, the labels are kept and used in tandem with synthetic labels. Examples of these specialist models are Segment Anything (SAM) for obtaining object segmentation annotations, Grounding DINO for object detection labels, etc.

Data Filtering and Enhancement

To minimize noise and imprecision in the dataset’s annotations, the authors employ a filtration and enhancement process which prioritizes richness of visual concepts in images, and confidence and necessity in regional annotations. To do this, the author’s develop filtering approaches which focus on text and regional data.

For text data, a filtering approach based on Complexity, Action, Text (CAT) filtering from Radenovic et al. (2023) is used. This method takes a given sentence and develops a semantic graph of context-rich objects, attributes, and actions. For example, given the sentence “A black cat is chasing a small brown bird,” a rule-based parser would extract the objects “cat” and “bird,” the attributes “black,” “small,” and “brown,” and the action “chasing.” These extractions are then used to create a semantic graph to estimate the complexity of the image captions. Complexity is defined as the maximum number of relations (objects, attributes, or actions) to any object in the graph. For the object “bird,” we have the attributes “small” and “brown,” and the action “chasing.” Depending on specific requirements, a thresholded number of relations can be defined to ensure sufficient richness in the caption.

For regional data, the authors filter using a confidence score of an annotated region assigned by the specialist models. Specifically, annotations below a specified threshold are removed. Additionally, to eliminate redundant, overlapping annotations, the authors use non-maximum suppression, which suppresses or eliminates annotations that have lower confidence scores and retains those with higher scores.

Iterative Data Refinement

Following initial annotation and enhancement, the dataset underwent iterative refinement. This process included training Florence-2 and using its outputs to update annotations, which significantly improves annotation accuracy. Subsequently, the dataset underwent additional training rounds to further refine annotations, enhancing their precision. In cases where filtration had removed annotations from specialist models for certain tasks, the iteratively trained model was utilized to reintroduce these annotations. Sparse or task-specific datasets may be leveraged to fine-tune Florence-2, enabling it to effectively serve as a specialist annotator and contribute to the continuous refinement of FLD-5B.

Annotation Specific Variations

FLD-5B features 3 types of annotations: text, region-text pairs, and text-phrase-region triplets. By defining specific annotations, FLD-5B can support a number of tasks. 

Text

Text annotations have three levels of specificity: brief, detailed, and more detailed. Brief text includes a single sentence, whereas detailed and more detailed text contain multiple sentences with increasing richness of objects, attributes, and actions. With text annotations, FLD-5B supports tasks describing entire images, such as image classification and image captioning.

Region-Text Pairs

Region-text pairs are text annotations associated with a semantic region and include three types of textual annotations: a phrase from the object's category, a brief text annotation, and noun phrase chunks from the brief text. Similar to text annotations, region-text pairs have varying levels of specificity, which adjust the richness of detail in an annotation. With region-text pairs, FLD-5B supports regional annotation tasks, such as image segmentation, region classification, region captioning, and optical character recognition (OCR).

Text-Phrase-Region Triplets

Text-phrase-region triplets consist of a text annotation of the image with the same three levels of specificity, noun phrases corresponding to image objects, and region annotations for the image objects. With text-phrase-region triplets, FLD-5B supports descriptive image captioning, connecting region-text pairs within a given image using image context.

Model Performance and Comparisons with other VLMs

With the innovative dataset generation described above, Florence-2 is able to achieve remarkable results across various tasks. Comparatively, the focus of Florence-2 is less on its adaptability to a variety of tasks through further training and fine-tuning like PaliGemma, but rather showing an astonishing zero-shot capability, demonstrating results comparable to other foundational models but with significantly smaller model size compared to other vision language models. The two primary architectures detailed had 232 million parameters and 771 million parameters.

At the image level, the zero shot performance of the large models achieves close to or state-of-the-art benchmarks with a small percentage of the parameters used. It additionally achieves competitive performance for pixel-level and region-level tasks without any training.

Overall, the underlying reasons for the strong performance of Florence-2 in zero-shot settings has to do with the large and well-designed multi-modal, multi-task dataset that they are able to leverage. This also highlights the true difficulty of training a robust visual language model, in that data quality, which can be hard to obtain for smaller organizations, heavily influences the model’s success. Additionally, the significantly reduced number of parameters suggests that comprehension of image and text in conjunction has less to do with the architectural robustness and more of a parameter adjustment to the training data.

Transfer Learning and Fine-Tuning Process and Results

For multi-task transfer learning, the models are pre-trained with a subset of 72 million samples from the original dataset. They are subsequently transferred to a combined dataset with varying combinations of tasks, the best performing combination across all tasks would likely be the one pre-trained on image, region, and pixel tasks. It performs slightly worse than image and region level task trained models for tasks not requiring segmentation, but is superior at the pixel-level tasks.

In the table below, we compare the results of PaliGemma checkpoints specifically fine-tuned for downstream tasks versus Florence-2 which was trained on multiple tasks at once.

As we can see, Florence-2, even when trained on multiple tasks, remains in the realm of competitiveness with PaliGemma despite being 10 to 3 times smaller, even when PaliGemma is fine-tuned for a single task for image-level or language based outputs such as visual question answering and image captioning. However, there is a significant improvement in region-level and pixel-level tasks. This demonstrates an architectural and training setup advantage in the feasibility and allowance of the visual encoder to be unfrozen for tuning and enhanced robustness coming from a more visual understanding due to a larger focus on region and pixel level information from the pre-training stage. PaliGemma’s advantage in the natural language generation tasks may come from larger model size as well as more retained information from a well pre-trained large language model like Gemma and the funneling of all token information towards the language space architecturally, which likely has a stronger language capability overall than Florence-2.

The authors also examine the effects of scalability, regarding model size and dataset size. Though there was a lack of variety for model sizes, it was clear that larger models still clearly outperformed smaller models across all tasks. For pre-training dataset size, there was a clear logarithmic pattern to performance, with exponential data size increase leading to diminishing performance gains.

Finally, the authors also measure the effects of freezing or unfreezing vision encoders during training. Valuably, they determine that vision encoders for image-level tasks can be left  frozen, but for region-level or pixel-level understanding, unfreezing the vision encoders becomes essential for learning computer vision tasks. Additionally, initializing with language based pre-trained weights is helpful for natural language output tasks, but does little for purely computer vision based tasks.

Below, we’ve attached an example notebook for fine-tuning Florence-2 with the encoders frozen in order to reduce memory usage for the task of visual question answering. Readers can modify this notebook to experiment with their own tasks.

What’s Next?

If you have questions, feel free to join our Community Slack to post your questions or contact us to finetune your own visual language model on Datature Nexus. You can try out a visual language model training for object detection on Nexus with our support for PaliGemma fine-tuning. 

For more detailed information about the model functionality, customization options, or answers to any common questions you might have, read more on our Developer Portal.

Developer’s Roadmap

Datature recognizes the importance of multimodal models in practical use cases. With this in mind, we have incorporated the PaliGemma architecture for fine-tuning and will be adding more model architectures, such that Datature Nexus users will be able to import and annotate multimodal datasets that can be used to train a model like Florence-2 for their specific use cases. We also note that while achieving academic benchmarks is significant, practical deployment still requires other considerations such as guard railing to ensure consistency and quality.

Build models with the best tools.

develop ml models in minutes with datature

START A PROJECT