## Intro

The method of *least squares* is the work horse for many applications of linear algebra to the field of statistics.

Ever since Adrien Marie Legendre published the method back in 1808, it has shown to be an extremely powerful tool for analysis.

But is Legendre really the person to take credit for the development of the method?

Shortly after his publication, another well-known mathematician, Carl Friedrich Gauss proposed the claim that he had been in fact using the method since 1795.

Still to this day, there is a debate within the mathematical community whether it was Legendre or Gauss who initially developed the method.

The fact that the proponents of both sides care so deeply about crediting their candidate hints at the importance of the method.

## Concept

When trying to fit a line to some experimental data, it is generally not possible to make it go through each individual point. Instead, we will draw a line that is a good compromise for all of them.

To define a criterion of how effective a line is in accounting for all points, we can choose to minimize the sum of the squared distances from the points to the line, vertically.

This results in a relatively easy way of determining a line that cuts right through the middle of the data points.

The task becomes an optimization problem, which is a field of its own in mathematics. This particular one is solved using the least squares method, and will always have a unique solution.

## Math

The points and its corresponding line can be represented by a system of linear equations in matrix form as:

where , and are constructed using the data points, and contains unknown parameters of the line ( and ).

The problem here is that we tend to have more data points than parameters, leading to a system without a solution.

This is where the least squares method comes in handy. By left-multiplying both sides of the equation by , we obtain a quadratic system:

This equation has a unique solution , containing the optimal parameters and that minimize the least squares criterion.

## Least squares

The least squares method is frequently used in higher courses within STEM. The method fits a line to measurement data to explain the relationship to a phenomenon. The phenomenon can be anything from a controlled physical experiment, to a series of observations in the real world, e.g. psychology or economics. Students with career ambitions as an analyst or data scientist now get their first tool for working with modelling.

All models are faulty, but some are useful.

Say we have measurement data in the format , that is, for each observation we have a set of variables that we want to relate to the respective value at . We like to express this relationship as a function that best explains the relationship between and . We can never demand that our model gives us for some , because all models are wrong, but some are useful. Therefore, we use the approximation sign in the following way:

If is a linear mapping, we can derive the following from the above equations:

This system of equations is practically overdetermined, because represents the constants of , which are usually only a handful in number, while the number of rows can be hundreds, thousands or even millions (think of the amounts of data Google and Facebook work with). Thus, there are no solutions to this equation. On the other hand, we look for the values of that optimally fit the function to the measurement data.

The least squares method minimizes the distance between the points and the line.

A mathematical definition for what makes an "optimally-fitted" function is to find values of that generate the least deviation, or "error", from the measurement data. We define the error as:

which we recognize as the distance between and . We sum this error for all observations to get the total error. Briefly:

We want to find the function that minimizes the error .

We define the error as the sum of all distances between the points and the line.

We said that the least squares method minimizes the distance between the points and the line. We can rewrite the system of equations to the famous equation:

where the constants we want to find are the variable (conventional notation for unknown) and the right-hand is written as the conventional right-hand (conventionally what is known).

This is an optimization problem and belongs to a completely separate branch of mathematics. However, this is a very simple optimization problem because the solution is *unique* and *easy to calculate*. Without proof or justification, we now show the calculation. We multiply by on both sides from the left.

The equation is called *the normal equation* and is a quadratic system whose unique solution is the values of the constants that minimize the distance between the points and the line. This elegant method is minimized to the following expression:

### Example 1. Straight line

Let's say we have measurement data on peoples' heights and shoe sizes in the format:

where is shoe size and is height. This measurement data appears to grow in a linear relationship between the axes, which is logical. Bigger feet are usually found on taller people, and vice versa. This justifies fitting a straight line to the measurement data, namely for some values of the constants and , we should be able to get a line that explains the relation:

We have measurement points, so therefore we can set this up as a linear system of equations with lines:

We can write this as an augmented matrix:

This system is called *overdetermined* because there are more equations than unknowns, or equivalently, the rows are more numerous than the columns. (Note: an *underdetermined system* applies to the opposite scenario, that is, there are less equations than unknowns). We multiply by from the left on both sides:

This system generates a unique solution, namely the values of and that provide the best fit line for the measurement data.

### Example 2. Second degree equation

Much in nature and reality is not linear. Some phenomena first have an increasing effect to a maximum followed by a decreasing effect. One such example is a classic in business. If you want to increase the sales of a business, you must increase the price. As the price is raised, you can expect a certain loss of customers, i.e. you get fewer customers who pay more. However, sales still increase up to a certain point. What has happened at the breaking point is that the price per paying customer no longer compensates for the loss of customers, and the total turnover starts to decrease again.

Let the measurement data be in the format:

where is the price of the product / service and is the total turnover. We adapt the following curve to the measurement data:

and we get the following system, where each row corresponds to the fit of a measured point:

We convert to the normal equation:

which provides a unique solution for the values of the parameters , and that give the best fit quadratic curve to our measurement data.