There are different ways to find the slope and y intercept of a line, but the most used method is known as the least squares method. The principle behind this method is simple: you have to find the best line that reduces the sum of squared error.
In Figure 6.1, you can see a Cartesian plane with multiple points and lines in it. Line a represents the best fit for this data – in other words, that would be the best linear regression function for those points. But how can you know that? It is simple: if you compute the error associated with each point, you will realize that Line a contains the least sum of squared errors.
Figure 6.1 – Visualizing the principle of the least squares method
It is worth understanding linear regression from scratch not only for the certification exam but also for your career as a data scientist. To provide you with a complete example, a spreadsheet containing all the calculations that you are going to see has been developed! You are encouraged to jump on this support material and perform some simulations. In any case, you will see these calculations in action in the next subsection.
You are going to use a very simple dataset, with only two variables:
You want to understand the relationship between x and y and, if possible, predict the salary (y) based on years of experience (x). Real problems very often have far more independent variables and are not necessarily linear. However, this example will give you the baseline knowledge to master more complex algorithms.
To find out what the alpha and beta coefficients are (or slope and y intercept if you prefer), you need to find some statistics related to the dataset. In Table 6.3, you have the data and these auxiliary statistics.
X (INDEPENDENT) | Y (DEPENDENT) | X MEAN | Y MEAN | COVARIANCE (X,Y) | X VARIANCE | Y VARIANCE |
1 | 1.000 | 21.015 | 20 | 21.808.900 | ||
2 | 1.500 | 14.595 | 12 | 17.388.900 | ||
3 | 3.700 | 4.925 | 6 | 3.880.900 | ||
4 | 5.000 | 1.005 | 2 | 448.900 | ||
5 | 4.000 | 835 | 0 | 2.788.900 | ||
6 | 6.500 | 415 | 0 | 688.900 | ||
7 | 7.000 | 1.995 | 2 | 1.768.900 | ||
8 | 9.000 | 8.325 | 6 | 11.088.900 | ||
9 | 9.000 | 11.655 | 12 | 11.088.900 | ||
10 | 10.000 | 19.485 | 20 | 18.748.900 | ||
COUNT | 10 | 5,50 | 5.670,00 | 8.425,00 | 8,25 | 8.970.100,00 |
Table 6.3 – Dataset to predict average salary based on the amount of work experience
As you can see, there is an almost perfect linear relationship between x and y. As the amount of work experience increases, so does the salary. In addition to x and y, you need to compute the following statistics: the number of records, the mean of x, the mean of y, the covariance of x and y, the variance of x, and the variance of y. Figure 6.2 depicts formulas that provide a mathematical representation of variance and covariance (respectively), where x bar, y bar, and n represent the mean of x, the mean of y, and the number of records, respectively:
Figure 6.2 – Mathematical representation of variance and covariance respectively
If you want to check the calculation details of the formulas for each of those auxiliary statistics in Table 6.2, please refer to the support material provided along with this book. There, you will find these formulas already implemented for you.
These statistics are important because they will be used to compute the alpha and beta coefficients. Figure 6.3 explains how you can compute both coefficients, along with the correlation coefficients R and R squared. These last two metrics will give you an idea about the quality of the model, where the closer they are to 1, the better the model is.
Figure 6.3 – Equations to calculate coefficients for simple linear regression
After applying these formulas, you will come up with the results shown in Table 6.4. It already contains all the information that you need to make predictions, on top of the new data. If you replace the coefficients in the original equation, y = ax + b + e, you will find the regression formula to be as follows: y = 1021.212 * x + 53.3.
Coefficient | Description | Value |
Alpha | Line inclination | 1,021,212,121 |
Beta | Interceptor | 53 |
R | Correlation | 0,979,364,354 |
R^2 | Determination | 0,959,154,538 |
Table 6.4 – Finding regression coefficients
From this point on, to make predictions, all you have to do is replace x with the number of years of experience. As a result, you will find y, which is the projected salary. You can see the model fit in Figure 6.4 and some model predictions in Table 6.5.
Figure 6.4 – Fitting data in the regression equation
INPUT | PREDICTION | ERROR |
1 | 1.075 | 75 |
2 | 2.096 | 596 |
3 | 3.117 | – 583 |
4 | 4.138 | – 862 |
5 | 5.159 | 1.159 |
6 | 6.181 | – 319 |
7 | 7.202 | 202 |
8 | 8.223 | – 777 |
9 | 9.244 | 244 |
10 | 10.265 | 265 |
11 | 11.287 | |
12 | 12.308 | |
13 | 13.329 | |
14 | 14.350 | |
15 | 15.372 | |
16 | 16.393 | |
17 | 17.414 | |
18 | 18.435 | |
19 | 19.456 | |
20 | 20.478 |
Table 6.5 – Model predictions
While you are analyzing regression models, you should be able to know whether your model is of good quality or not. You read about many modeling issues (such as overfitting) in Chapter 1, Machine Learning Fundamentals, and you already know that you always have to check model performance.
A good approach to regression models is performing what is called residual analysis. This is where you plot the errors of the model in a scatter plot and check whether they are randomly distributed (as expected) or not. If the errors are not randomly distributed, this means that your model was unable to generalize the data. Figure 6.5 shows a residual analysis based on the data from Table 6.5.
Figure 6.5 – Residual analysis
The takeaway here is that the errors are randomly distributed. Such evidence, along with a high R squared rating, can be used as arguments to support the use of this model.