Now that you know what variance and data splitting are, you can go a little deeper into the training dataset requirements. You are very likely to find questions around data shuffling in the exam. This process consists of randomizing your training dataset before you start using it to fit an algorithm.
Data shuffling will help the algorithm to reduce variance by creating a more generalizable model. For example, let’s say your training represents a binary classification problem and it is sorted by the target variable (all cases belonging to class “0” appear first, then all the cases belonging to class “1”).
When you fit an algorithm on this sorted data (especially some algorithms that rely on batch processing), it will make strong assumptions about the pattern of one of the classes, since it is very likely that it won’t be able to create random batches of data with a good representation of both classes. Once the algorithm builds strong assumptions about the training data, it might be difficult for it to change them.
Important note
Some algorithms are able to execute the training process by fitting the data in chunks, also known as batches. This approach lets the model learn more frequently since it will make partial assumptions after processing each batch of data (instead of making decisions only after processing the entire dataset).
On the other hand, there is no need to shuffle the testing set, since it will be used only by the inference process to check model performance.
So far, you have learned about model building, validation, and management. You can now complete the foundations of ML by learning about a couple of other expectations while modeling.
The first one is parsimony. Parsimony describes models that offer the simplest explanation and fit the best results when compared with other models. Here’s an example: while creating a linear regression model, you realize that adding 10 more features will improve your model’s performance by 0.001%. In this scenario, you should consider whether this performance improvement is worth the cost of parsimony (since your model will become more complex). Sometimes it is worth it, but most of the time it is not. You need to be skeptical and think according to your business case.
Parsimony directly supports interpretability. The simpler your model is, the easier it is to explain it. However, there is a battle between interpretability and predictivity: if you focus on predictive power, you are likely to lose some interpretability. Again, you must select what is the best situation for your use case.