What is Dataset Exploration?
As machine learning has progressed rapidly in recent years, researchers have turned towards data-centric approaches to improve model performance. As such, understanding the data that is being used for model training is essential for successful use cases. Even if efforts are made to analyze the initial set of data that is used at the start of the project, in most practical settings, the MLOps process of iteratively collecting and curating data, annotating data, training models, and redeploying to production is unavoidable to maintain high levels of accuracy over extended time periods. Intuitive and useful tools are needed to explore and manage datasets at any point in the project.
Search tools are one important way of filtering for or just exploring data. Search tools can be used for direct image unique IDs such as filenames, or for ancillary metadata which describes important aspects of the data surrounding an image such as the number of annotations, the size of the image, etc.
Additionally, search can be done on the image data itself, albeit in non-traditional ways. Analyzing and exploring structured data is a well understood space, with established industry practices from basic statistical tests, visualization techniques, to more advanced techniques such as random forest or K-means clustering. However, by virtue of its name, unstructured data cannot be manipulated in the same ways. More innovative approaches are needed to convert visual data into more uniform, easily analyzable data that inherently contains the most salient visual features. Visual data such as images and video certainly fall under the category of unstructured data due to complications in many aspects. For instance, they can come in varying image dimensions so pixel data sizes are inconsistent. Within the pixel data itself, it can be easily influenced by other factors like contrast or saturation which can cause significant noise such that typical statistical algorithms can’t extract meaningful information about the dataset as a whole. As such, a variety of approaches are needed to help people analyze their data.
Designing Effective Search
What is Image Similarity Search?
Image similarity search is an asset management tool that allows users to search through an image dataset for similar or dissimilar images using a query image. The recent development of foundational computer vision models has aligned well with the task of image retrieval, due to their use of embedding architectures to construct uniformly sized vectors from images. Transformer architectures such as CLIP (developed by OpenAI) and DINOv2 (developed by Meta Research), have met the most success both in academic benchmarks and practical settings. This is due to their inherent generalizability and ability to capture visual and semantic details within comprehensible vectors, effectively transforming millions of pixels into a list of less than a thousand numbers.
General Approach
Assets in a dataset will have corresponding embedding vectors generated. These vectors will then be the source of comparison. When a user selects a query image and requests for the most similar/dissimilar images, the associated query vector will be compared with all the other vectors in the dataset, and sort the dataset in such a way that images corresponding with the closest/farthest vectors are rendered in that order.
Image Embedding Architectures
The two architectures that we considered were CLIP and DINOv2, which are both vision transformer architectures, but trained with different datasets and with different methods. As these are the two state-of-the-art models in the space, we compare them in several factors in our use case.
Architectural Design
They are largely focused on utilizing a vision transformer with an encoder and decoder with multiple heads which can be adapted for several tasks. CLIP also contains experiments for smaller architectures such as ResNet-50. They typically offer models of several sizes, from ViT-S, ViT-B, ViT-G, etc. to accommodate various scales and levels of details.
DINOv2 and CLIP loss functions also differ. DINOv2 uses image-level and patch-level objectives with the weights untied, and uses Sinkhorn-Knapp centering as well as a KoLeo regularizer to construct a more uniform span in the features, thus, making the features more distinct from each other, and allowing for more granularity.
CLIP loss functions are focused on maximizing cosine similarity between text and image pairs and minimizing cosine similarity between the incorrect text and image pairs. The cosine similarity is calculated from embeddings developed from image and text encoders.
Training Methods
CLIP is trained on a dataset of paired data of a caption and the image itself. DINOv2 is trained through self-supervised settings with an incredibly large dataset as compared to the fully supervised CLIP. DINOv2 also utilizes knowledge distillation from its larger models to achieve stronger performance but with fewer parameters, thus condensing foundational model performance in more accommodating model sizes.
Based on the above criteria, we determined that DINOv2 was more capable of storing important visual details for a broader range of data without being swayed by attempts at semantic understanding that CLIP might attempt to encode. As such, we are utilizing DINOv2 as our underlying image embedding model for search.
Choice of Similarity Metrics
There are a wide variety of distance metrics that can be used to quantify how similar two vectors are. Common distance metrics are the L2 or Euclidean distance and cosine similarity. While Euclidean distance is a more prototypical metric for a broad class of tasks, embedding vectors such as ones generated by CLIP and DINOv2 are incentivized in their ambient space to be closer based on cosine similarity if their features are similar. As such, our search also relies on cosine similarity calculations to order the images during a search.
When users search for the most similar images, images are ranked by their cosine similarity scores from highest to lowest -
Cosine similarity is calculated as above, where A and B are arbitrary embedding vectors, and the resulting cosine similarity varies from -1 to 1, where intuitively, -1 means that the embedding vectors are opposite in concept, 0 meaning they have no relation, and 1 being that they are exactly the same.
What is Metadata Query?
The Datature Metadata Query is a built-in search engine on the Nexus platform to allow teams to better understand their data. It can be useful at any stage of the MLOps process, This is useful at any stage in the computer vision pipeline, whether it acts as a tool for data discovery after annotation is complete to identifying edge cases that can impact model training and performance.
Metadata query is enhanced beyond simply searching for values by integrating intuitive logical conditional statements as searchable queries. For instance, one would be able to use relational operators like <, >, <=, >=, ==, and %. Additionally, one can tag together multiple conditional statements using logical operators such as ‘and’ and ‘or’. The fields that are available for query are the number of instances of a specific tag, the file size, image height and width, filename, asset group, and annotation status. Given that there is a lot of freedom and expressivity in making these queries, one can construct queries from simple filename searches to filtering based on annotation completeness, number of annotations, and the image height and width, all in one request. Here is an example of one such query:
This query looks for images which contain more than 2 annotations of squirrels or an image height of more than 200 pixels, and is part of the asset group “dataset2”. Click here to learn more about the specific details of metadata query.
These can be used not just for blindly restricting and filtering data, but also used to spot outliers using your industry or application-specific knowledge. For example, if you know that your dataset is full of bicycles with labeled wheels, 'wheels % 2 == 1’ will indicate images in which images of a bicycle may be incomplete or poor quality images that should be checked for validity. This allows for a more intentional search of your dataset.
How Advanced Search Tools Work on Nexus
Dataset management on Datature’s Nexus is centralized in the Datasets page, which can be navigated to on the sidebar from the project homepage. Once there, you will be able to manually navigate to assets using the search bar.
This is where the Metadata Query can be used for searching with the various queries described above or with a filename. The filtered data based on the query will be rendered below.
For asset similarity search, once the embeddings for each image have been calculated and the search indexes have been built, you will be able query with any image in your dataset. To see whether that process is complete, there will be a purple icon indicating it is complete or a grey icon showing that it is still in the process of being calculated, which can be found on the upper left, right below the asset upload rectangular box at the top. With individual images, you will be able to select the extra options on the bottom right of an individual image, and select for similar images. The Datasets page will then be reordered with the closest assets in terms of image features. This can help with identifying other similarly significant images that you would like annotated or identifying anomalous images from the typical images that you receive. To see more technical details about similarity search, you can take a look at our documentation here. As shown below, the drop down button next to an image allows for Similarity Search.
The list of the most closely related assets based on structure and content in the image are rendered below. As you can see below, with a query using the image with a horse jockey and a horse, the search of the top 100 most similar assets largely consist of other images with horse jockeys and horses, or people riding horses.
With the two different tools of metadata query and similarity search, users are able to tackle a broad range of asset management and dataset exploration tasks. Metadata query uses the tabular metadata provided by the user to precisely find the exact images that can be prescribed to explicit criteria, and as such, its capabilities grow with the more metadata a user is able to provide, in the form of annotations or asset groups. For instance, if one is looking for assets that contain a specific combination of annotations that one has already made, then metadata query is a great way to easily navigate or discover if there are any such relevant images. In a practical example, a user might want to find out if there are any images that have been improperly preprocessed by searching if there are any images with a height or width that is too large.
In cases when users want to leverage the more intangible visual attributes within the image data itself to dig deeper into their dataset, similarity search is the more appropriate tool. Similarity search aims to target use cases such as trying to see if there are other similar images when an anomalous image appears in a large dataset that has recently been brought in, and there hasn’t been enough time or resources to manually work through the dataset. For example, in an MLOps workflow that leverages active learning, when anomalous images in production are re-uploaded to the platform, one can use similarity search to see if any other images in the dataset are visually similarly anomalous and simply were not identified earlier.
What’s Next?
If you want to try out any of the features described above, please feel free to sign up for an account and navigate to the Datasets page to use our search tools and explore our other asset management tools such as asset bulk actions and asset group management.
Our Developer’s Roadmap
Providing capabilities for exploring large datasets is an ever evolving process and Datature is committed to enabling users to better understand their datasets in a variety of ways. We will continue to leverage these embeddings for usage such as reduced dimension visualization to make data exploration even easier and intuitive. Additionally, to better integrate with our other filtering tools on the Annotator and in the Workflow creator on Nexus, you will be able to add images found through similarity search to new or pre-existing asset groups.
Want to Get Started?
If you have questions, feel free to join our Community Slack to post your questions or contact us about how asset similarity search fit in with your usage.
For more detailed information about the asset similarity search functionality, customization options, or answers to any common questions you might have, read more about the process on our Developer Portal.
Build models with the best tools.
develop ml models in minutes with datature