Data splitting – Machine Learning Fundamentals – MLS-C01 Study Guide

Data splitting

Training and evaluating ML models are key tasks of the modeling pipeline. ML algorithms need data to find relationships among features in order to make inferences, but those inferences need to be validated before they are moved to production environments.

The dataset used to train ML models is commonly called the training set. This training data must be able to represent the real environment where the model will be used; it will be useless if that requirement is not met.

Coming back to the fraud example presented in Table 1.1, based on the training data, you found that e-commerce transactions with a value greater than $5,000 and processed at night are potentially fraudulent cases. With that in mind, after applying the model in a production environment, the model is supposed to flag similar cases, as learned during the training process.

Therefore, if those cases only exist in the training set, the model will flag false positive cases in production environments. The opposite scenario is also true: if there is a particular fraud case in production data, not reflected in the training data, the model will flag a lot of false negative cases. False positive and false negative ratios are just two of many quality metrics that you can use for model validation. These metrics will be covered in much more detail later on.

By this point, you should have a clear understanding of the importance of having a good training set. Now, supposing you do have a valid training set, how could you have some level of confidence that this model will perform well in production environments? The answer is by using testing and validation sets:

Figure 1.4 – Data splitting

Figure 1.4 shows the different types of data splitting that you can have during training and inference pipelines. The training data is used to create the model; the testing data is used to extract the final model quality metrics. The testing data cannot be used during the training process for any reason other than to extract model metrics.

The reason to avoid using the testing data during training is simple: you cannot let the model learn on top of the data that will be used to validate it. This technique of holding one piece of the data for testing is often called hold-out validation.

The box on the right side of Figure 1.4 represents the production data. Production data usually comes in continuously and you have to execute the inference pipeline in order to extract model results from it. No training, nor any other type of recalculation, is performed on top of production data; you just have to pass it through the inference pipeline as it is.

From a technical perspective, most ML libraries implement training steps with the .fit method, while inference steps are implemented by the .transform or .predict method. Again, this is just a common pattern used by most ML libraries, but be aware that you might find different name conventions across ML libraries.

Still looking at Figure 1.4, there is another box, close to the training data, named Validation data. This is a subset of the training set often used to support the creation of the best model, before moving on to the testing phase. You will learn about validation sets in much more detail, but first, you should understand why you need them.