Boosting Yield and AI Model Accuracy With High-Quality Data
Date Section Blog

High-performance vision AI models require pure data input. However, in semiconductor manufacturing, data labeling inconsistency is a major challenge limiting AI model performance. A centralized system improves data quality, reduces costly human error, and streamlines data labeling and curation processes. Manufacturers boost yield, which in turn increases customer trust and satisfaction.
The Impact of High-Quality Data: Garbage In, Garbage Out
Pure results demand pure data, especially in AI-powered quality control. If data labeling is only 85% accurate, achieving 99.95% accuracy in results is impossible. Poor quality labeling leads to too many false positives, where good chips are rejected or chips with minor defects are flagged as serious—resulting in reduced yield and unnecessary waste. When too many errors go undetected, quality suffers due to false negatives resulting in defective products, lost customers, and damaged brand reputation.
Although some applications perform adequately with "impure" data, data purity etches out that extra 1% performance to reduce waste and improve yield. However, several causes prevent manufacturers from achieving this.
Human Error
The most common cause of impure data is human error. People make mistakes, are limited in their ability to manage labeling tasks, and can only label effectively when given context. For example, inspection equipment vendors who implement AI-based defect classification solutions in their machines, tend to obtain images from their customers. The data often contains labeling mistakes caused by the person doing the labeling. If labeling use and benefit are not understood, the outcome will be sub-par.
Even if everyone is perfectly aligned on annotation, human error is inevitable. Whether mistakes are made once, twice, or hundreds of times on huge datasets, the impact on performance is non-negligible. Imagine data at 98% accuracy is used to train an AI model for defect detection. Even a 2% misclassification rate can result in ignored critical defects leading to field failures.
Communication Issues
Another major cause of impure data is poor communication between the domain expert and the model creator. Last year Robovison worked with a business where the fab operators producing chips also labeled data. Since only they had the expertise necessary for labeling, the data scientists depended on them. However, the operators' defect definitions and classification methods did not align with the data scientists. With no way to streamline or centralize the communication process, data management became chaotic. The result: confusion, delays, and concerns over the usability of AI in semiconductor manufacturing. Communication breakdowns like this are common amongst siloed teams—the domain expert tries explaining to the equipment builder what the defect looks like, and everything is lost in translation.
Decentralized Data and Labels
Data scientists and operators often label data independently using different naming conventions and label strategies. This mismatch creates purity issues that impair model performance when data is combined or reused.
Robovision also sees end customers struggling with de-centralized data due to a lack of systems integration. Many have fabs with hundreds of machines from multiple different brands. Some have a mix of decades-old, irreplaceable legacy machines and newly adopted machines under a year old. Legacy systems store data in siloed formats, while newer machines use more advanced, standardized outputs. Data remains scattered, making it hard to ensure accuracy. Tracking changes in labeling criteria is difficult, while inconsistent updates lead to mislabeled data and underperforming AI models. These challenges persist regardless of the number, age, or brand of machines: without centralization, creating consistent, high-quality datasets becomes difficult, diminishing AI effectiveness in quality control.
Robovision’s Solution for Improving Data Accuracy
Whatever the root cause of impure data, a review flow or a retroactive purity check is essential for prevention. The value of these checks becomes even greater with datasets exceeding 100K records. By performing a purity check operators can easily improve their model in a few clicks and ultimately, improve quality and yield.
Cue the Purity Loop, Robovision’s closed-loop feedback system. Built to tackle complex and critical tasks, the system improves training data accuracy and pushes AI to act on its best behavior. Since the Robovision 5.8 AI platform deploys anywhere in a production environment, the Purity Loop continuously curates data at every stage to optimize model performance. Data selection is trained through tagging and filtering, meaning users can organize their data with metadata. They can also easily inspect data through enhanced filtering. To evaluate training data, a graphical analysis of class distribution can be performed to verify class balance.
The Purity Loop addresses both the subjectivity and objectivity of the annotated dataset quality. As production scales, the system makes it easier to handle large data volumes. It maintains consistent labeling across multiple teams, sites, or production lines without compromising dataset quality.
Secure Customer Trust With Accurate Data Labeling
Accurate data labeling is crucial for developing high-performing AI models. While poor-quality data impacts an algorithm’s effectiveness, high-quality data labeling is a catalyst to help manufacturers overcome the most problematic industry hurdles. With the right training data, AI models can identify and eliminate defects that can cause catastrophic errors, ensuring the highest product quality to increase customer trust and satisfaction.