We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 - 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!
There’s an interesting give and take with machine learning (ML) models.
Just as humans increasingly rely on them, they rely on us. While they can dramatically speed up and hone human processes, they also need to be fed the correct information – by humans – to be able to do this job.
“They don’t have common sense, they only learn from what you tell them,” said Eric Landau, cofounder and CTO of London computer vision company Encord, which claims to have developed the first-ever tool to help address a fundamental, growing, time-consuming problem of unlabeled data.
The inherent quandary is that there’s no shortage of data in the world – it only continues to accumulate by the day, hour, minute. However, much data remains unlabeled and is thus unusable by ML models. Humans must often do the labeling and natural human distraction can lead to errors that then have to be corrected, resulting in double work.
Event
Transform 2022
Join us at the leading event on applied AI for enterprise business and technology decision makers in-person July 19 and virtually from July 20-28.
“It’s relying on human judgment to correct other errors in human judgment,” Landau said. “If you aren’t careful about feeding a model the exact right annotations, this will have negative consequences in the real world.”
The urgent need for data quality
Today, the London-based Encord announced the general release of its data quality assessment technology that automatically detects errors within annotated training data. This can help make AI development less expensive, time-consuming and difficult to scale, Landau said.
“Model building can be a slow, arduous process,” he said. “Data quality is an urgent need for machine learning teams. We wanted to speed up that process, to make it easier for teams to build models a lot quicker.”
The technology uses micro models based on neural networks that can be finely targeted so that big models can train on large amounts of data. It is agnostic to use case, so users feed in whatever that essential data may be. “These are small, targeted models that are good at one thing, not very general,” Landau said.
For example, for dashcams that detect road signs, several micro models can be strung together that each individually understand signs for, say, certain U.S. states or European cities.
The tool also applies the growing technique of self-supervised learning. Only the “most egregious” cases are passed back to human eyes to help it operate most robustly and optimize human time.
The technology is being used by specialist computer vision companies including Teton.AI and SurgEase, as well as healthcare institutions King’s College London and St Thomas’s Hospital. Landau said he sees ML use cases across a variety of areas, from satellite imaging to radiology.
“It doesn’t require humans to review every single data point. It’s a general approach to the problem but it’s extremely important,” he said. “As far as we know, it is the first of its kind automated label quality assessment tool for computer vision.”
A data-centric approach
Founded in 2020, Encord it is backed by CRV, Y Combinator, WndrCo and Crane Venture Partners and in May 2022 was named one of CBInsights’ AI 100 list of most innovative artificial intelligence startups.
Still, there are several other much larger companies tackling the same data labeling issues, including Scale AI and Snorkel. But Encord is clearly on the upswing: Its tools have been used by Kings College London, Memorial Sloan Kettering Cancer Center and Stanford Medical Centre to help process 3X more images and reduce experiment duration by 80%, Landau said.
The company has helped hospitals annotate pre-cancerous polyp videos, resulting in increased efficiency by an average of 6.4 times. It has automated 97% of labels to help clinicians become 16x more efficient at labeling medical images. It has even loftier plans to accelerate medical research by 100x, according to Landau.
He emphasized the importance of relying on the data, rather than the model – a growing concept in AI use cases. The longstanding practice has been the “model-centric” approach, but focusing on data is “more acute.”
“What you’re feeding the model is the most important thing,” he said. “The quality of the model is the quality of the data.”
That’s because if you just think about models and how to fix them, you lose the perspective of the issue you’re trying to solve, he pointed out. When data is improperly annotated, models learn the wrong thing and individuals can get hurt. If a polyp is overlooked in a gastroenterology video, for example, or models can’t identify when a patient in an elderly care home has fallen, or the numerous issues with autonomous vehicles.
Particularly in building medical diagnostic AI systems, scientists require training data from all types of demographics – ages, nationalities, characteristics. “You can only do that with data-centric models,” Landau said. “If you just think about the model, not the data, you will lose that.”