Least squares method – Applying Machine Learning Algorithms – MLS-C01 Study Guide

Least squares method

There are different ways to find the slope and y intercept of a line, but the most used method is known as the least squares method. The principle behind this method is simple: you have to find the best line that reduces the sum of squared error.

In Figure 6.1, you can see a Cartesian plane with multiple points and lines in it. Line a represents the best fit for this data – in other words, that would be the best linear regression function for those points. But how can you know that? It is simple: if you compute the error associated with each point, you will realize that Line a contains the least sum of squared errors.

Figure 6.1 – Visualizing the principle of the least squares method

It is worth understanding linear regression from scratch not only for the certification exam but also for your career as a data scientist. To provide you with a complete example, a spreadsheet containing all the calculations that you are going to see has been developed! You are encouraged to jump on this support material and perform some simulations. In any case, you will see these calculations in action in the next subsection.

Creating a linear regression model from scratch

You are going to use a very simple dataset, with only two variables:

  • x: Represents the person’s number of years of work experience
  • y: Represents the person’s average salary

You want to understand the relationship between x and y and, if possible, predict the salary (y) based on years of experience (x). Real problems very often have far more independent variables and are not necessarily linear. However, this example will give you the baseline knowledge to master more complex algorithms.

To find out what the alpha and beta coefficients are (or slope and y intercept if you prefer), you need to find some statistics related to the dataset. In Table 6.3, you have the data and these auxiliary statistics.

X (INDEPENDENT)Y (DEPENDENT)X MEANY MEANCOVARIANCE (X,Y)X VARIANCEY VARIANCE
11.000 21.0152021.808.900
21.500 14.5951217.388.900
33.700 4.92563.880.900
45.000 1.0052448.900
54.000 83502.788.900
66.500 4150688.900
77.000 1.99521.768.900
89.000 8.325611.088.900
99.000 11.6551211.088.900
1010.000 19.4852018.748.900
COUNT105,505.670,008.425,008,258.970.100,00

Table 6.3 – Dataset to predict average salary based on the amount of work experience

As you can see, there is an almost perfect linear relationship between x and y. As the amount of work experience increases, so does the salary. In addition to x and y, you need to compute the following statistics: the number of records, the mean of x, the mean of y, the covariance of x and y, the variance of x, and the variance of y. Figure 6.2 depicts formulas that provide a mathematical representation of variance and covariance (respectively), where x bar, y bar, and n represent the mean of x, the mean of y, and the number of records, respectively:

Figure 6.2 – Mathematical representation of variance and covariance respectively

If you want to check the calculation details of the formulas for each of those auxiliary statistics in Table 6.2, please refer to the support material provided along with this book. There, you will find these formulas already implemented for you.

These statistics are important because they will be used to compute the alpha and beta coefficients. Figure 6.3 explains how you can compute both coefficients, along with the correlation coefficients R and R squared. These last two metrics will give you an idea about the quality of the model, where the closer they are to 1, the better the model is.

Figure 6.3 – Equations to calculate coefficients for simple linear regression

After applying these formulas, you will come up with the results shown in Table 6.4. It already contains all the information that you need to make predictions, on top of the new data. If you replace the coefficients in the original equation, y = ax + b + e, you will find the regression formula to be as follows: y = 1021.212 * x + 53.3.

CoefficientDescriptionValue
AlphaLine inclination1,021,212,121
BetaInterceptor53
RCorrelation0,979,364,354
R^2Determination0,959,154,538

Table 6.4 – Finding regression coefficients

From this point on, to make predictions, all you have to do is replace x with the number of years of experience. As a result, you will find y, which is the projected salary. You can see the model fit in Figure 6.4 and some model predictions in Table 6.5.

Figure 6.4 – Fitting data in the regression equation

INPUTPREDICTIONERROR
11.07575
22.096596
33.117– 583
44.138– 862
55.1591.159
66.181– 319
77.202202
88.223– 777
99.244244
1010.265265
1111.287
1212.308
1313.329
1414.350
1515.372
1616.393
1717.414
1818.435
1919.456
2020.478

Table 6.5 – Model predictions

While you are analyzing regression models, you should be able to know whether your model is of good quality or not. You read about many modeling issues (such as overfitting) in Chapter 1, Machine Learning Fundamentals, and you already know that you always have to check model performance.

A good approach to regression models is performing what is called residual analysis. This is where you plot the errors of the model in a scatter plot and check whether they are randomly distributed (as expected) or not. If the errors are not randomly distributed, this means that your model was unable to generalize the data. Figure 6.5 shows a residual analysis based on the data from Table 6.5.

Figure 6.5 – Residual analysis

The takeaway here is that the errors are randomly distributed. Such evidence, along with a high R squared rating, can be used as arguments to support the use of this model.