By Bernadette Wilson, on Dec 26, 2019

What is Data Annotation & Why Does It Matter?

With artificial intelligence (AI) and machine learning (ML) adoption on the rise, data annotation workloads are skyrocketing. Compared to just a few years ago, data annotation has grown into a much larger, time-consuming task. As a result, AI development teams are looking for ways to manage data annotation without sacrificing accuracy or quality.

What is Data Annotation?

Data annotators create metadata in the form of code snippets that describe or categorize data. Companies have used data annotation in the past to identify patterns and to make data searchable. Now, however, organizations are focusing their resources on data annotation to prepare data stacks for structured ML or training sets for unstructured ML.

Adding metadata to code is a relatively straightforward task, but there’s much more to consider when annotating data in preparation to train a machine learning or artificial intelligence system. Your ML model will only be as accurate as its training data’s annotation.

Consider how the accuracy of data annotation could make or break these projects:

  • Text and internet search: By labeling concepts within text, ML models can learn to understand what people are searching for — not just word for word, but taking a person’s intent into account.
  • Chatbots: Data annotation can give chatbots the ability to respond appropriately to a query, whether spoken or typed.
  • Natural language processing (NLP): NLP systems can learn to understand the meaning of a query and generate intelligent responses.
  • Optical character recognition (OCR): Data annotation allows data engineers to build training sets for OCR systems that can recognize and convert handwritten characters, PDFs, and images or words to text.
  • Language translation: ML models can learn to translate spoken or written words from one language to another.
  • Autonomous vehicles: Advancing self-driving vehicle technology is a prime example of why accurately training ML systems to recognize images and interpret situations are important.
  • Medical images: Data engineers are training models to detect cancerous tissue or other abnormalities from X-ray, sonogram, or other medical images.

If you train these systems — or any other ML system with data that’s been labeled inaccurately, the results will be inaccurate, unreliable, and provide no value to the user.

Options for Managing Data Annotation Workloads

With so much riding on the quality of data labeling, it’s risky to make it a part of your engineers’ workloads. Labeling large volumes of data could monopolize their time — or, worse, not get the attention this pivotal task deserves. Even dedicated, in-house data annotation resources may not be able to label large volumes of data in time to meet a project deadline or have the agility to manage requests to add different types of data or labeling to an ML training data set.

There are a number of different options available for managing high-volume data projects, and industry leaders are optimistic about automated approaches to these tasks, but the simple fact is that for the foreseeable future, many of these tasks are going to need to be performed by hand, or at the very least will need a human touch.

Outsourcing Data Annotation

These obstacles are driving the growth of outsourced data annotation.  In addition to providing AI and ML systems developers with the resources they need, partnering with a third-party for data annotation gives developers the added benefit of eliminating internal bias. A data annotation provider outside your organization won’t have any expectations of how the model should behave and, therefore, he or she won’t label data with a specific outcome in mind. A data annotation provider with a diverse talent pool will also reduce bias toward specific cultures, races, or other demographics. And, if your project would benefit from unique skills, such as annotators who are bilingual or have expertise with a particular coding language, your outsourced data annotation partner can assign the right people from its team for the job.

A third-party data annotation provider also has the advantage of a singular focus. The team isn’t pulled in multiple directions to try to get a product to market or design a specific system by a client’s deadline. A data annotation provider’s project managers ensure that data annotation is accomplished accurately, securely, and on time.

Setting stopping or sunsetting criteria is also easier when you work with a third-party provider. As soon as you determine that further data labeling won’t provide added return, you can end the project — rather than taking an in-house resource’s schedule or continued employment into account.

According to Cognilytica, working with a third-party data annotation organization will also save you money. Their research shows that data preparation accounts for about 80 percent of the time required for most AI and ML projects, and internal data labeling is five times more costly than the services of a third-party provider.

Cognilytica also projects that although data preparation tools themselves are adding ML to make them more efficient, there will continue to be a need for human intervention into the future. Determine the best course for your organization to accomplish this vital task and ensure your AI and ML systems deliver reliable results based on quality data annotation.

Opinions expressed by Daivergent contributors are their own.