A borehole is incorrectly drilled. A component is placed incorrectly in the installation space. The material supply is interrupted. All these are typical scenarios that have a negative impact on production and, in the worst case, can paralyse it. So-called outlier detection – an efficient method for detecting and correcting errors on the basis of machine data – offers some help here. Or, even better: avoiding mistakes in the first place.
The era of Industry 4.0 gives us more machine data than ever before. So why are there still errors in production? Dr. David Breyel, data scientist at connyun, says: "Collecting data alone does not add value. Only rarely does a measured value provide direct information, for example, whether a spare part is needed." So, it is important to gain meaningful insights from the masses of data and to derive recommendations for action. That's exactly what the Data Science experts at connyun are doing. Outlier detection plays a key role in this.
"Outliers are simply data points that do not meet expectations," explains the data specialist. "In the case of one or two-dimensional data sets, humans are often still able to recognise outliers. However, today, machines deliver large, high-dimensional datasets. Nothing works without computer support.” But how does outlier detection work?
Step 1: Identifying outliers
Many methods – such as one-class support vector machines or nearest neighbor distance – are based on human thinking. Breyel puts it this way: “Measure the distances between the data points and mark those that are far from all the others." If the distances between data points are not meaningfully measurable, the data scientist and his colleagues at connyun resort to methods that are not distance-based, such as isolation forests. “We typically apply different methods to a data set and then continue to use the most successful one, because no two data set are the same, and often, even the smallest details cause significant differences."
Step 2: Classifying outliers – error, coincidence or anomaly?
Once the outliers have been identified, they are classified. "We distinguish between error, coincidence and anomaly," explains David Breyel. Errors are caused, for example, by defective measuring equipment or typing errors during input. Coincidences are correctly recorded measurements that initially appear extraordinary, but that can occur with a certain degree of probability and are therefore harmless. Anomalies are data points that occur due to previously unobserved effects, and are therefore subject to a different statistical distribution.
Step 3: Using outliers – for data cleansing or troubleshooting
If an outlier is identified as an error or a coincidence, it will be used for data cleansing. The effect: further data analysis is not unintentionally influenced or even falsified. When initial results have been generated or new data is added to the project, the outlier detection should be performed again and the data set should be updated accordingly.
"If it is an anomaly, the outlier is of greater interest than the cleansed data itself," emphasises David Breyel. “For example, we use anomalies to identify faulty workpieces without reassessment. Or we discover effects that result from converting one machine to another variant.” Precisely matched to the respective anomaly, further operations are performed on the data in order to automatically detect and avoid these cases in the future.
Based on the outliers, connyun’s data scientists also create clear recommendations and thus create added value for the customer. For example, a message is sent to the machine operator or maintenance engineer to check a workpiece for errors or to request a replacement part.
Outlier detection in practice – an example
A company produces a complex product from a range of different sub-assemblies. One of these assemblies consists of component A with a length of 100 mm and component B with a length of 150 mm. Suddenly, assemblies with distinct length deviations appear in production: they are too short. If the cause is not discovered and remedied, there will be bottlenecks in production. In such cases, connyun's data science experts would start searching for outliers.
The data analysis begins on goods receipt and during the manual quality control: Company employees measure a random selection of the components by hand and enter the result in an Excel spreadsheet. The outlier is quickly found: for the normally 100 mm long component A, values of 10 appear in some places. However, this outlier is quickly classified as an error: an employee entered the length in centimetres instead of millimetres. The data is corrected, adjusted and re-analysed. The result: Component A shows no appreciable deviation.
The deviations must therefore be in component B. And indeed, the incoming goods table shows that the supplier produces items with an average of 150 mm in length and a standard deviation of 1 mm. But some of the parts were measured with values around 147.5 mm. Has this been the case in about every 50th time since a certain point in time? If so, it could be an anomaly – perhaps because the supplier put a new machine into operation that had not yet been set to the correct size. However, the data analysis shows that only about every 200th component B has this smaller dimension – so it is a coincidence, and the problem is most probably not with the supplier.
The length deviation cannot be explained with the data from goods receipt. Consequently, it must be a deviation that occurs during production. Subsequent data analysis confirms this. It shows a clear outlier, in this case an anomaly: much of the faulty assemblies were made in the same shift. It turns out that the employees in this shift inadvertently mixed prototypes of component B with the input product. These prototypes are shorter than the original components. The company takes immediate action to avoid such errors in the future.
Outlier detection creates added value – a conclusion
In summary, it can be said that outlier detection is not just a necessary evil for cleansing the data, but an elementary part of data analysis and thus a prerequisite for machine learning. It often creates added value, even on its own.