Overfitting and underfitting – Machine Learning Fundamentals – MLS-C01 Study Guide

Overfitting and underfitting

ML models might suffer from two types of fitting issues: overfitting and underfitting. Overfitting means that your model performs very well on the training data but cannot be generalized to other datasets, such as testing and, even worse, production data. In other words, if you have an overfitted model, it only works on your training data.

When you are building ML models, you want to create solutions that are able to generalize what they have learned and infer decisions on other datasets that follow the same data distribution. A model that only works on the data that it was trained on is useless. Overfitting usually happens due to the large number of features or the lack of configuration of the hyperparameters of the algorithm.

On the other hand, underfitted models cannot fit the data during the training phase. As a result, they are so generic that they can’t perform well within the training, testing, or production data. Underfitting usually happens due to the lack of good features/observations or due to the lack of time to train the model (some algorithms need more iterations to properly fit the model).

Both overfitting and underfitting need to be avoided. There are many modeling techniques to work around them. For instance, you will learn about the commonly used cross-validation technique and its relationship with the validation data box shown in Figure 1.4.

Applying cross-validation and measuring overfitting

Cross-validation is a technique where you split the training set into training and validation sets. The model is then trained on the training set and tested on the validation set. The most common cross-validation strategy is known as k-fold cross-validation, where k is the number of splits of the training set.

Using k-fold cross-validation and assuming the value of k equals 10, you are splitting the training set into 10 folds. The model will be trained and tested 10 times. On each iteration, it uses 9 splits for training and leaves one split for testing. After 10 executions, the evaluation metrics extracted from each iteration are averaged and will represent the final model performance during the training phase, as shown in Figure 1.5:

Figure 1.5 – Cross-validation in action

Another common cross-validation technique is known as leave-one-out cross-validation (LOOCV). In this approach, the model is executed many times and, within each iteration, one observation is separated for testing and all the others are used for training.

There are many advantages of using cross-validation during training:

  • You mitigate overfitting in the training data since the model is always trained on a particular chunk of data and tested on another chunk that hasn’t been used for training.
  • You avoid overfitting in the test data since there is no need to keep using the testing data to optimize the model.
  • You expose the presence of overfitting or underfitting. If the model performance in the training/validation data is very different from the performance observed in the testing data, something is wrong.

It might be worth diving into the third item on that list since it is widely covered in the AWS Machine Learning Specialty exam. For instance, assume you are creating a binary classification model, using cross-validation during training, and using a testing set to extract final metrics (hold-out validation). If you get 80% accuracy in the cross-validation results and 50% accuracy in the testing set, it means that the model was overfitted to the training set, and so cannot be generalized to the testing set.

On the other hand, if you get 50% accuracy in the training set and 80% accuracy in the testing set, there is a systemic issue in the data. It is very likely that the training and testing sets do not follow the same distribution.

Important note

Accuracy is a model evaluation metric commonly used on classification models. It measures how often the model made a correct decision during its inference process. That metric was selected just for the sake of demonstration, but be aware that there are many other evaluation metrics applicable for each type of model (which will be covered at the appropriate time).