> For the complete documentation index, see [llms.txt](https://docs.totalmateria.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.totalmateria.com/predictor/predictor-2-whitepaper/development-methodology/data-curation-methodology.md).

# Data Curation Methodology

To get to the scope of data feasible for the application of ML, it is necessary to divide the problem space into smaller subspaces. Criteria for division and classification may vary from self-evident, such as type and family of materials, properties to product forms and thermomechanical treatments of material. Diversified taxonomies already applied in the industry, such as VDA 231-200 \[6[^1]], can be helpful guideline in this process.

&#x20;The division of the problem space leads the creation of dozens and even hundreds of specific ML models for certain families of materials and properties, as it can be seen in Figure 1, for the example of aluminum alloys. For each of them, it is then necessary to prepare datasets appropriate for ML purposes. This data curation process is actually very complex and includes the elimination of redundant and repetitive inputs, exclusion of extreme or inconsistent values and, handling missing data or intervals (e.g. in chemical composition).

<figure><img src="/files/yUPmWcx1oJRjc2n71bq3" alt=""><figcaption><p>Fig 1. An example of the classification of ML models for aluminum alloys.</p></figcaption></figure>

One of particularly challenging parts is the transformation of categorical values, such as heat treatment, into consistent numerical values that can be effectively used by ML models. To achieve that, a nine-digit classification system was used, where the first digit determines larger group of treatments or temper type, e.g. 1 for F state (as fabricated), 2 for H state (Strain Hardened), 3 for O state (Annealed) and 4 for T state (Thermally Treated), and other digits are determined by subdivisions of tempers. For example, T62 temper gets code 462000000, H111 temper products get 211100000, whilst T8E30 gets 480000130. The meaningfulness and consistency of applied classifications is of crucial importance for the quality of models and their results.

&#x20;The representativeness of datasets for training and testing is also important for the performance of the ML models. Even though the datasets for each ML model have a considerably more manageable size than the whole initial database, still obtaining uniform coverage of the application domain of several thousand points in *n*-dimensional space is not a trivial problem. A combination of random splitting and Kennard-Stone algorithm has been applied to achieve that, Figure 2.

<figure><img src="/files/hrP7nKNrsvMlZo7OICHx" alt=""><figcaption><p>Fig 2. Preparation of datasets for training and testing of ML models.</p></figcaption></figure>

[^1]: VDA 231-200: Werkstoffdatensatz - Spezifikation von Werkstoffen und Oberflächen in IT-Systemen / Material record, VDA, 2016


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.totalmateria.com/predictor/predictor-2-whitepaper/development-methodology/data-curation-methodology.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
