Data Curation Methodology
Last updated
Last updated
©2024 Total Materia AG. All Rights Reserved
To get to the scope of data feasible for the application of ML, it is necessary to divide the problem space into smaller subspaces. Criteria for division and classification may vary from self-evident, such as type and family of materials, properties to product forms and thermomechanical treatments of material. Diversified taxonomies already applied in the industry, such as VDA 231-200 [], can be helpful guideline in this process.
The division of the problem space leads the creation of dozens and even hundreds of specific ML models for certain families of materials and properties, as it can be seen in Figure 1, for the example of aluminum alloys. For each of them, it is then necessary to prepare datasets appropriate for ML purposes. This data curation process is actually very complex and includes the elimination of redundant and repetitive inputs, exclusion of extreme or inconsistent values and, handling missing data or intervals (e.g. in chemical composition).
One of particularly challenging parts is the transformation of categorical values, such as heat treatment, into consistent numerical values that can be effectively used by ML models. To achieve that, a nine-digit classification system was used, where the first digit determines larger group of treatments or temper type, e.g. 1 for F state (as fabricated), 2 for H state (Strain Hardened), 3 for O state (Annealed) and 4 for T state (Thermally Treated), and other digits are determined by subdivisions of tempers. For example, T62 temper gets code 462000000, H111 temper products get 211100000, whilst T8E30 gets 480000130. The meaningfulness and consistency of applied classifications is of crucial importance for the quality of models and their results.
The representativeness of datasets for training and testing is also important for the performance of the ML models. Even though the datasets for each ML model have a considerably more manageable size than the whole initial database, still obtaining uniform coverage of the application domain of several thousand points in n-dimensional space is not a trivial problem. A combination of random splitting and Kennard-Stone algorithm has been applied to achieve that, Figure 2.