Tutorials

Improving Datasets & ML Models with Metadata Query

Keechin

December 16, 2021

MIN READ

Sections

What is Datature Metadata Query?

The Datature Metadata Query is an in-built search engine on the Nexus platform to allow teams to better understand their data. To improve CV models, teams must first understand where they fail, which is where the Metadata Query fits naturally into our existing solutions to ensure ML teams adopt a data-centric approach to AI development. Querying your data falls within any step of the development process, from performing labeler QA and sanity checks to identifying edge cases causing model confusion after visualizing neural network inferences with Portal.

The Need for Metadata Query

Based on our experience, the limiting factor of model performance in real life scenarios often falls on the dataset and not the model - teams usually obsess over various model architectures and hyperparameters and treat their data as something secondary - often placing blind faith in benchmark datasets that are prone to annotation errors as well. For teams that actively check through thousands of images as part of their QA effort - they are not only checking for tightness of the bounds, or how pixel-accurate the polygons are - most of the time, they are running checks on the correctness and sanity of these labels which might have been done by different team members or an external labeling company - which is needless to say an extremely tedious process.

*Error Analysis of the Open Images v4 test set. 36% of the "false positive" predictions by the model were actually due to annotation error!* *Source*.

With Metadata Query, teams will be able to streamline their QA processes 10x faster so that even small teams can run sanity checks on their data pre and post-training. We believe that this lets teams improve production AI by moving past aggregate metrics and placing an emphasis on understanding their data in depth. In our eyes, once teams understand their qualitative failure modes, they can fix them by gathering the right data to handle edge cases as well as debug poor model performance - reducing time spent sifting through data, and more time on model iteration.

Metadata Query & Data-Centric AI

With the advent of Data-centric AI, we see Metadata Query as another tool that fits within the iterative workflow of training a model, validating how the model performs (using model visualization tools like Portal), before conducting error analysis and determining if more data of a certain class is required or if labels need to be corrected.

*Data-Centric Development on the Nexus and* *Portal* *Platform. Source: Author*

‍

The use of Metadata Query is powerful when combined with tools like Portal, which allows you to spot classes that are not being detected above a certain confidence threshold. Before jumping back to the training workflow and experimenting with different model architectures or hyperparameters, inspecting the dataset is worth its weight in increasing model performance. By digging deeper into the underlying dataset to look at how a certain class of objects has been labelled, you may be able to uncover labeling errors, biases or perhaps edge cases that require more data to be collected. For example, in the context of a steel defect detection model, perhaps your labelers were uncertain of what is considered a "defect" - causing some models to be predicted as "Not Defective" when in reality they are - increasing the number of false negatives (where positive = defective, negative = not defective). Thus, by understanding the underlying reason for model failure, your team would be able to decide on a next course of action to improve model performance.

*Model-Centric vs Data-Centric AI approach to improving model performance.* *Source*

With Metadata Query, you can:

Find the number of images that are labelled with a certain class

This can be used by teams as a preliminary sanity check before model training to identify any potential data imbalances in the underlying dataset. For example, in this facemask dataset, querying 'mask worn incorrectly' returns 94 out of 833 images which is a clear sign of class imbalance which may cause biases in the model that is trained.

*Sample of Metadata Query: "mask worn incorrectly"*

*Querying all Images that are Labeled with "mask worn incorrectly"*

Find the number of images that meet a specified quantity

With the Metadata Query, teams can identify images that contain a specified number of objects using operators such as 'AND' and 'OR'. In the example below, we are essentially searching for all images where the number of RBC and WBC labelled is greater than or equal to 4 or images where the number of Platelets labelled are less than or equal to 3.

*Sample of Metadata Query with Operators*

*Querying all Images that Contain the Following: "RBC>= 4 and WBC >= 4 or Platelets <= 3"*

Ensuring all images are in the appropriate dimension / orientation as a preprocessing sanity check

A highly requested feature was teams wanting to filter out large amounts of data as part of their curation process - filtering out data that may not be in the appropriate orientation and resolution. For example, if we want to find all images where the width is less than the height and the file size is below a certain threshold, we can easily do so using a simple division operator.

*Sample of Metadata Query to Ensure Dimension Consistency*

*Querying all images that are in the appropriate dimensions: "width / height < 1 and fileSize <= 100,000"*

Finding any mislabeled images that do not make logical sense

This is ideal for teams who are training a model to recognize objects that come in a fixed number of items. For example, bicycles must come with 2 tires and in this case we want to find images that have an 'odd' number of bicycle tires. With Metadata Query, this can be achieved using the modulo "%" function which returns a single image. Selecting the image allows us to spot the annotation error where both tires were being labeled using a single bounding box.

*Querying Images that Have an "Odd" Number of Bicycle Tires Using "Tyres % 2"*

Finding images based on filename

Based on our experience, the sheer task of data organization curating a dataset for model training is often manual and tedious. Teams can now find their relevant assets based on their filenames which often contain information such as date collected, time of collection as well as location collected.

*Querying images that contain a "20" in their filenames*

Finding edge cases that may not make logical sense

*Querying Images that Contain More WBC than RBC via: "WBC > RBC"*

In some scenarios, there will be natural class imbalances such as the number of red blood cells (RBC) being greater than the number of white blood cells (WBC). With Metadata Query, these sanity checks can be easily performed by comparing the number of classes with another in any given asset, allowing teams to dig deeper into any inconsistencies and correct mislabeled data.

How to Use Metadata Query

Metadata Query is essentially a simple filter expression where multiple properties (static and dynamic) and operators can be combined to form a complex logical expression. Static properties in this case, refers to an assets (i) fileSize, (ii) fileName, (iii) height, and (iv) width. Dynamic properties refer to the class names that you have defined for your tags during the data labeling process. Hence any changes to class label names will affect how you query your data. A full list of properties and operators can be found below.

To get started querying your data, ensure that you have the following:

Nexus Account - If not, sign up for a free one here!
An Image Dataset
Labels for the Dataset - COCO, YOLO, PASCALVOC, etc

Upon clicking on "Images" in the sidebar, you will be able to see your image thumbnails along with the search bar for you to submit your metadata queries. Tip: You can filter and search your dataset via filename as well - especially useful for teams who have an internal naming system!

Each asset has the following properties:

Tag Names (case sensitive)
fileSize
fileName
height
width
instances (instances here refer to the total number of annotations on a particular image)

Supported list of operators as follows:

Operator	Example	Explanation
and	RBC and WBC	Returns all images that are labeled with "RBC" AND "WBC"
or	RBC or WBC	Returns all images that are labeled with either "RBC" OR "WBC"
Addition (x +y)	RBC + WBC >= 5	Returns all images where the sum of "RBC" and "WBC" labels is more than or equal to 5
Subtraction (x - y)	RBC - WBC <= 3	Returns all images where the subtraction of "WBC" from "RBC" labels is less than or equal to 3
Division (x / y)	RBC / WBC = 2	Returns all images where the division of the number of "RBC" labels over "WBC" is equal to 2
Modulo (x % y)	RBC % 2	Returns all images where there are is an 'odd' number of "RBC" labels
Power (x ^ y)	RBC ^2 <= 16	Returns all images where the number of labels of "RBC" raised to the power of 2 is less than or equal to 16

‍

Supported list of comparisons as follows:

Comparison	Example	Explanation
Equals (x == y)	RBC == instances	Returns all images where the number of "RBC" labels is equal to the total number of labels in the image i.e "Show me all images where there are only RBC (red blood cells)"
Does not equal (x ! = y)	RBC ! = WBC	Returns all images where the number of "RBC" labels is NOT equal to the number of "WBC" labels
Less than ( < )	RBC < 3	Returns all images where the number of "RBC" labels is less than 3
Less than or equal to ( <= )	RBC - WBC <= 3	Returns all images where the subtraction of "WBC" from "RBC" labels is less than or equal to 3
More than ( > )	RBC / WBC > 2	Returns all images where the division of the number of "RBC" labels over "WBC" is more than 2
More than or equal to ( >= )	RBC + WBC >= 3	Returns all images where the sum of "RBC" and "WBC" labels are greater than or equal to 3

‍

What's Next?

With Metadata Query, how you splice and dice your data is all up to you and the possibilities are endless! Training accurate models requires a deep understanding of your training data. By visualizing your dataset, teams can spot any flaws or imbalances in ground truth labels as well as identify potential edge cases which will guide model development. We have a roadmap for increasing the robustness of the query engine to support the identification of edge cases not performing above a set confidence threshold as well as the ability to train a new model based on a subset of data.

Want to get started?

Metadata Query is now available across all plans (starter as well), so all you have to do is to sign up for a free account on the Nexus platform and kickstart your first project. If you have more questions, feel free to join our Community Slack to post your questions.

What is Datature Metadata Query?