AWS has a built-in algorithm known as linear learner, where you can implement linear regression models. The built-in linear learner uses Stochastic Gradient Descent (SGD) to train the model.
You will learn more about SGD when neural networks are discussed. For now, you can look at SGD as an alternative to the popular least squares error method that was just discussed.
The linear learner built-in algorithm provides a hyperparameter that can apply normalization to the data, prior to the training process. The name of this hyperparameter is normalize_data. This is very helpful since linear models are sensitive to the scale of the data and usually take advantage of data normalization.
Data normalization was discussed in Chapter 4, Data Preparation and Transformation. Please review that chapter if you need to.
Some other important hyperparameters of the linear learner algorithm are L1 and wd, which play the roles of L1 regularization and L2 regularization, respectively.
L1 and L2 regularization help the linear learner (or any other regression algorithm implementation) to avoid overfitting. Conventionally, regression models that implement L1 regularization are called lasso regression models, while regression models with L2 regularization are called ridge regression models.
Although it might sound complex, it is not! The regression model equation is still the same: y = ax + b + e. The change is in the loss function, which is used to find the coefficients that best minimize the error. If you look back at Figure 6.1, you will see that the error function is defined as e = (ŷ – y)^2, where ŷ is the regression function value and y is the real value.
L1 and L2 regularization add a penalty term to the loss function, as shown in the formulas in Figure 6.6 (note that you are replacing ŷ with ax + b):
Figure 6.6 – L1 and L2 regularization
The λ (lambda) parameter must be greater than 0 and manually tuned. A very high lambda value may result in an underfitting issue, while a very low lambda may not result in expressive changes in the end results (if your model is overfitted, it will stay overfitted).
In practical terms, the main difference between L1 and L2 regularization is that L1 will shrink the less important coefficients to 0, which will force the feature to be dropped (acting as a feature selector). In other words, if your model is overfitting because of the high number of features, L1 regularization should help you solve this problem.
During your exam, remember the basis of L1 and L2 regularization, especially the key difference between them, where L1 works well as a feature selector.
Finally, many built-in algorithms can serve multiple modeling purposes. The linear learner algorithm can be used for regression, binary classification, and multi-classification. Make sure you remember this during your exam (it is not just about regression models).
AWS has other built-in algorithms that work for regression and classification problems –that is, factorization machines, KNN, and the XGBoost algorithm. Since these algorithms can also be used for classification purposes, these will be covered in the section about classification algorithms.
You’ve just been given a very important tip to remember during the exam: linear learner, factorization machines, KNN, and XGBoost are suitable for both regression and classification problems. These algorithms are often known as algorithms for general purposes.
With that, you have reached the end of this section about regression models. Remember to check out the supporting material before you take the exam. You can also use the reference material when you are working on your daily activities! Now, let us move on to another classical example of a machine learning problem: classification models.