Mathematics Behind Linear Regression

Exploring statistics using Calculus

Published in

Towards Data Science

17 min readSep 16, 2021

Hi Everyone,

This is about the mathematics that is used in the linear regression (with gradient descent) algorithm. This was a part of my IB HL Mathematics Exploration.

Linear Regression: What is it?

Linear Regression is a statistical tool that produces a line of best fit for a given dataset analytically. To produce the regression line manually, one needs to perform operations such as mean-squared error and optimizing the cost function; both are explained in detail later in the document. The main problem arises when the size of the dataset is so large that it becomes computationally inefficient to be done by hand. Therefore, when a dataset size becomes large the computer can perform the task much quicker just with a few simple lines of code in any language.

Linear regression algorithm uses a dataset (pairs of input and output values) to generate a line of best fit for that dataset. To start, the algorithm generates a hypothesis in the form 𝑦 = 𝑎𝑥 + 𝑏, because the algorithm aims to output values of a and b.

The Computer Program: How does it find these constants (a and b)?

Python has its own linear regression methods to find these constants in the lines of best fit. Below is a brief about the algorithm and its computation methods. It uses a modified version of least-squares regression to arrive at these values after a certain number of iterations. In this scenario, iterations are repeating a given process based on a previous value. Below is the modified method of least-squares regression.

It uses the mean squared-error function which is computed by using all input-output pairs; calculates the difference between the output and input for every pair and squares that difference. This is to make sure that the input-output pairs are closest to the line that the algorithm predicts. Once, the summation is computed, the aggregate value is divided by the number of input-output pairs. This is because the algorithm calculates the mean squared error. For linear regression to work well and produce a line of best fit accurate for the given dataset it needs to minimize this mean squared-error function called the ‘Cost Function’.

MSE Function Fits In

The cost function is based on the statistical concept of the mean-squared error. The figure below accurately depicts the process by which the mean-squared error is calculated and produced.

(Fig 1: Sample Scatter Plot, Image By Author)

Mean-squared error is calculated by taking the predicted value for a given input value and subtracting it from the actual value at that input. Then the difference is squared. This is done for all points on a particular graph. Once the summation is complete then the total value is divided by the number of data points on the plot to give an average value for the error.

The difference between the actual value and the predicted value is done by subtracting the output value of the given input on the line of best fit. This is because the mean-square error’s objective is to calculate the discrepancy in the dataset concerning the prediction. This is done by placing the regression line in such a way that the absolute difference between the actual value and the predicted values for all 𝑥 values are the least. This leads to the squared difference being least and Hence, the least mean squared error is obtained.

The mean squared error is calculated to show how close the scattered points are relative to the regression line. The problem with computing this section of the cost function one at a time is that it will take a very long time if the dataset size is in the thousands and millions as they generally are. Hence, a computer can process a big amount of data very quickly due to the advancements in technology.

At this point, there is clarity on the mean-squared error which is the total error between a prediction and the actual value. However, calculating only the error is not helpful as it doesn’t really return the values to form the equation of the line of best fit. Hence, we need some sort of hypothesis or function which will allow the computation of those values forming the line of best fit: The “Cost Function”.

The ‘Cost Function’

As described above the ‘Cost Function’ is derived from the mean squared error.

Let’s start with a hypothesis:

Above I have defined a hypothesis for the line of best fit which has parameters (x, theta1 and theta0). These are some of the parameters which constitute the “Cost Function”.

(Fig 3: Further on the Cost Function)

Theta1 and Theta0 are a set of 2 values, written in a list so that while programming, the values of Theta1 and Theta0 are stored in a list so that the computer can store the updates efficiently and less memory is taken up by updating in the list.

The ‘Cost Function’ shows how much discrepancy is there between the predicted line and the actual data points. Hence, the higher the value of the Cost Function the higher the discrepancy in the data.

Now as the name suggests; I want to minimise the “Cost Function” because I have no intention of predicting inaccurate and imprecise values for variables that might have a significant stake or influence on global problems such as rising earth temperatures leading to global warming (elaborated later in this article). Therefore, below is an outline of minimising the cost function so that the prediction is as accurate as possible.

Minimising the Cost Function

The global minimum of a function refers to the minimum point of a function in a specifically defined domain for input(s), i.e, for a 2D plot, there is one input for every output. However, in a 3D plot, there are 2 inputs for every output. Therefore, the objective is to identify 2 input values that can result in a minimum of the plane. To continue, The cost function as described above is computed by taking the differences between the actual value and the predicted value for a given input and squaring that difference. The difference is squared to eliminate any negative difference which might occur if the actual value is greater than the predicted value.

As the name suggests the ‘Cost Function’ returns the total squared error for specific values of Theta0 and Theta1 to construct the optimum regression line for the dataset. Hence, to find out the best values for Theta0 and Theta1, the cost function must be at the lowest y value it can take. Hence, this is the reason to optimize the cost function. Optimizing a particular function refers to finding an input value for that function such that the function is either at a maximum or a minimum depending on the scenario. In this case, the objective is to achieve the lowest discrepancy hence the cost function will need to converge at a global minimum.

To solve this problem, the values of parameters should be such that the ‘Cost Function’ is minimum. Hence, minimizing ‘Cost Function’ to its global minimum will produce the most accurate line of best fit. The process of minimizing the ‘Cost Function’ is called gradient descent.

Until this point, I have gone over the basics of the algorithm and how it is trying to achieve the objective of producing an accurate line of best fit. Recall that minimising the cost function is highly crucial. However, the cost function has 2 variables. Is it even possible?

Now when dealing with a function with only 1 variable, regular calculus rules can be applied to differentiate the function and then equate that derivative to 0 to obtain values of

which constitute the minimum(s) of the given function. However, in this scenario, a new concept called partial derivatives have to be used.

Partial Derivatives

Partial derivatives are a concept in calculus that takes into account the multi-variable component of the function; taking the derivative of a function concerning more than 1 variable. The derivative is separated into 2 or more parts as whenever the function is being differentiated for a given variable all other variables are taken to be constant.

Finding the minimum value of the ‘Cost Function’ and the corresponding Theta1 and Theta0 values requires the use of partial derivatives applied in a set of instructions called gradient descent.

Gradient Descent: Why?

As seen above, I will compute in the same manner except I will be computing partial derivatives of 𝐽(𝜃) to Theta0 and Theta1.

(The Cost Function, to be differentiated)

The derivative will be taken with respect to Theta1 and Theta0 because there are two variables in the function.

Computing the Partial Derivatives

(Fig 4: Partial Derivative with Respect to Theta0)

(Fig 5: Partial Derivative with Respect to Theta1)

Hence, using these computed partial derivatives, gradient descent can be computed. Now gradient descent as a process is the algorithm attempting to converge to the global minimum of the ‘Cost Function’(See: Minimising Cost Function Section). The process of minimizing the ‘Cost Function’ by updating the values of Theta0, Theta1 by a certain amount is called gradient descent.

Below is an overview of gradient descent and how the computer computes this process:

The updates given to Theta0 and Theta1 are derived from Euler’s method of solving differential equations. Since there are 2 variables: Theta0 and Theta1to be considered. Euler’s numerical method is applied to both simultaneously but separately indicating that Theta0 calculating doesn’t have a direct impact on Theta1.

For Theta0:

The algorithm updates the nth value of 𝜃#.using the (n-1)th value of Theta0 and the mean squared error at the nth value of Theta0 Hence, this is the reason why the developer of the algorithm has to initialise Theta0 to a random value so that the algorithm has an (n-1)st term to reference while running.

The learning rate is multiplied by the mean squared error and then subtracted from the value of theta because the mean error computer by the program has to be taken into account while updating Theta0

For Theta1 :

The difference for Theta1 is the updating function. The updating function is different due to the partial derivative computation. Similar to Theta1, the nth value of Theta1 is calculated by using the (n- 1)th value of Theta1. This uses the same logic as the update process for Theta1.

Hence, This is how the computer can be highly decisive and certain in returning values for Theta1 and Theta0 which constitutes the minimum point of the cost function. However, a major limitation to this is identifying the learning rate to use.

The product of the learning rate and the cost function at Theta0(n) and Theta1(n) is subtracted at the update because the algorithm is attempting to reach the minimum point of the cost function.

Learning Rate and its Relevance

𝜶 = Learning rate of the algorithm.

The learning rate indicates the number of times the algorithm has to run for the ‘Cost Function’ to converge at the global minimum. It is a constant value but the value of the learning rate for a linear regression algorithm is based on the developer’s personal choice. The learning rate indicates the speed at which the algorithm can minimise the cost function and output the respective values to form the regression line.

Now that there has been a brief introduction to the learning rate, let’s get back to observing how the Theta0 and Theta1 are simultaneously updated.

The combined updates for Theta0 and Theta1 are given below:

(Fig 8: Combined Updates for Theta0 and Theta1)

As the computation of gradient descent is based on Euler’s numerical method and Even though Euler’s numerical method can be performed manually, for large datasets the computer can perform it much faster and with higher precision. Hence, this eases data modelling for large companies and organizations.

This process will repeat for all ‘input-output pairs, (𝑥, 𝑦) pairs. Hence, the variable ‘i’ indicates the nth (𝑥, 𝑦) pair because the computer reads the 𝑥 and 𝑦(input-output pairs) as n length matrices of order n x 1, Since, gradient descent is a type of an iterative algorithm. These updates are introduced to get a result for the constants of the line of best fit. The algorithm is widely used in the machine learning industry for the implementation of various regression tools such as polynomial regression and logistic regression. I came across this concept while doing a machine learning course online. The phrase ‘Repeat until Convergence’ indicates that the algorithm will iterate for all values of (𝑥, 𝑦) until the ‘Cost Function’ converges at its global minimum.

Looking closely at the updates, it is seen that the Learning rate: 𝜶 has a significant influence on the updates as the learning rate is multiplied by the Cost at a given point.

Changing the Learning Rate (𝜶)

If the chosen value for the learning rate is too large, it means that the number of steps for the ‘Cost Function’ to converge is low and hence the algorithm is efficient. However, this may not be the case as choosing the learning rate too high can result in the algorithm potentially over-shooting instead of converging to the minimum.

If the chosen value for the learning rate is too low, it means that the number of steps for the ‘Cost Function’ to converge is low and hence the algorithm may not be very efficient as it will take longer to reach the global minimum.

It is necessary to visualise the implications of the learning rate on sample datasets where the variables do not have a very strong correlation.

I have used a small sample dataset of 10 elements to show the effect of different learning rates.

The data in the above table very evidently shows that choosing an appropriate learning rate is a crucial factor that affects the performance of the algorithm on a given dataset. This is seen as the increase from 0.01 to 0.1(just a factor of 10) led to such an inaccurate prediction for the line of best fit. However, the same increase of a factor of 10 from 0.001 to 0.01 has not given a very high difference in the equation of the line.

By this point, I have understood the linear regression algorithm completely which I had coded. Now I am going to evaluate the linear regression algorithm in the context of the datasets: Singapore’s Population and Earth Surface Temperatures.

Linear Regression: Useful?

I had stated earlier that the linear regression algorithm can predict unknown values for different types of data such as stock prices, population, weather etc. Therefore, below are 2 scenario’s I have selected which are on the extremes of a so-called spectrum of the applicability of linear regression. The first scenario I have chosen is the growth of Singapore’s population over 50 years. Even though population growth is generally modelled exponentially, I felt after critically studying the dataset; linear regression can also be applied as the dataset shows a very close linear relationship. Also, Since, I have recently moved from Singapore, I am very fond of the place: another factor I took into account while deciding what data to use.

I used the Singapore population dataset from WorldOMeter which has records of data all the way from the year 1950. Even though this may be a very uncertain measurement of the population due to the lack in quality of measurement tools in the past, I have not adjusted or calculated any uncertainties which may manipulate and change the data provided in the database.

(Fig 9: Singapore Population Data, Image By Author)

The above graph is generated using the python programming language and it shows the increase in the population of Singapore over 51 years after independence in 1965. However, visually the regression line does not show a precise generalization of the given data. This is because as seen on the graph the portion of total plotted points on the regression line is very low indicating that the r-square value is low at 0.664. The r-square value is calculated using a python library which allows me to identify the optimum learning rate I have to use so that I can achieve the highest possible r-square value for the given dataset. For the dataset I have chosen, the program indicated that I set the value of the learning rate; alpha to 0.1 to find the line of best fit.

(Equation Post Linear Regression)

To effectively test the performance of the linear regression algorithm on a different type of dataset, I have chosen to use Earth’s surface temperature data because of its nature to increase gradually over a long period of time; in this case 50 years. Scientists all around the world are using linear regression to predict the expected temperatures of the surface of the earth shortly. This knowledge can constitute a tool for humans to prepare themselves for the worst-case scenario possible.

To validate whether linear regression can be useful to predict future earth temperatures, I have obtained earth surface temperatures change since 1965. I will apply linear regression to the data from 1965 to 2014 to produce a regression line. Then, I will use the change in surface temperature data for 2015 to validate the regression line.

(Fig 10: Earth Surface Temperature Data, Image By Author)

As shown in the above diagram, the linear regression has performed quite well against the dataset. This is evident because the balance of points before and after the line is high and it has returned a reasonably high r-square value of 0.845 taking into account the large number of factors that may affect these types of data due to large uncertainty in data collection and processing. This shows that it can predict future data with low error.

(Equation Post Linear Regression)

Did it Work?

Looking at the Earth Surface Temperature:

To ensure that the regression line obtained using the algorithm is accurate and precise, I will use the value of growth in 2015 as the target.

Although there is an error in the predicted value compared to the actual value, it is very very small which shows that linear regression has proven very useful in this scenario to predict the temperature changes of Earth’s surface. This indicates that predictive analysis using linear regression can only be applied in certain scenarios where there are a large number of controlled variables and the variables which are used to identify the correlation are continuous variables.

Below is the study of Singapore’s population over 50 years to show how linear regression may harm decisions due to wrongful prediction :

It is seen from the graph drawn at the beginning, the coefficient of determination is very low. This indicates that the correlation is weak. Even though 0.664 may seem a high value given that the r-square value has a range of 0 to 1, the predictions made using this line of best fit will be filled with errors and would not be very accurate. To show this, I have intentionally excluded the years 2016,2017 and 2018 from my dataset so that I could compare the prediction to the actual value of the population.

Using 2016, 2017 and 2018 as testing values for the performance of this algorithm

Hence, as seen in the table above, when linear regression is applied to this dataset it has failed to perform well by predicting a very inaccurate quantity based on the 50 previous years. Thus, it is seen that linear regression is not a good fit for population data. This is a limitation of linear regression that can only be useful in a very small number of situations even if it may be a very widely used tool to predict future circumstances.

Both Datasets above have been applied to the Linear regression algorithm of machine learning. However, the algorithm is not the only method by which a regression line can be calculated. Hence, below are a few calculations of the same dataset using the least-squares regression method.

In theory, the least-squares regression method is as follows :

𝑐 can be calculated by substituting the point(𝑥mean, 𝑦mean ) into 𝑦 = 𝑚𝑥 + 𝑐. This is because it is certain that the line of best fit passes through 𝑥mean and 𝑦mean

In this case,

By comparison, it is seen that the least-squares regression method generates a very similar regression line compared to the ML algorithm. This indicates that the ML algorithm has performed well with the given dataset. Also, this shows that the residual error in training the dataset is very small as the coefficient values are very similar. Hence, this explains the validity of the ML algorithm in predicting future values for different types of datasets.

Conclusion

To conclude, linear regression is very useful but needs a very specific type of problems such as Body mass index datasets, stock exchange datasets etc. Firstly, linear regression is not able to perform well with population data, so it is safe to conclude that linear regression may perform poorly against data where there is a changing growth/decline rate. Hence, as an alternative solution, an exponential curve can be used to generalize the population data of Singapore because the exponential curve best represents growth in a certain quantity.

Secondly, even though the program was able to choose a learning rate for me, it is not the same in every case. Thus, this indicates another limitation to the linear regression algorithm. It can be very difficult and confusing to identify how quickly the cost function would need to converge.

Thirdly, linear regression when applied to non-linear relationships can result in very high errors and since the world, today is already battling uncertainty, the programming tool shouldn’t create further problems of wrongful and inaccurate prediction.

Lastly, recalling the initial aim, it is seen that the linear regression algorithm behaves very differently depending on the dataset, learning rate: parameters of the Cost Function. This indicates that depending on the scenario the linear regression algorithm can either perform extremely well or extremely poorly.

Thank you for Reading my exploration of the mathematics behind linear regression!.

Some datasets and concepts are not mine, so I have listed a few credits at the end. Also, I am open to any criticism/feedback you would wish to provide, whether that is in the comments or through email.

Email ID: rahulr280623@gmail.com, will try my best to respond promptly

Credits

Brownlee, Jason. “Linear Regression Tutorial Using Gradient Descent for Machine Learning.” Machine Learning Mastery, 12 Aug. 2019, machinelearningmastery.com/linear-regression-tutorial-using-gradient-descent-for- machine-learning/.

Gandhi, Rohith. “Introduction to Machine Learning Algorithms: Linear Regression.” Medium, Towards Data Science, 28 May 2018, towardsdatascience.com/introduction- to-machine-learning-algorithms-linear-regression-14c4e325882a.

Negi, Sameer. “Linear Regression — Detailed Overview.” Medium, Towards Data Science, 21 Sept. 2018, towardsdatascience.com/linear-regression-cost-function- gradient-descent-normal-equations-1d2a6c878e2c.

UN. “Singapore Population (LIVE).” Worldometer, 2020, www.worldometers.info/world-population/singapore-population/.

Anuveyatsu. “Global Temperature.” Global, 2018.