Why Least Squares?

Rod and Pegboard

Suppose you have a bunch of pegs scattered around on a wall, like this:

You see a general trend, and you want to take a rod and use it to go through the points in the best possible way, like this:

How do you decide which way is best? Here is a physical solution.

To each peg you attach a spring of zero rest length. You then attach the other side of the spring to the rod. Make sure the springs are all constrained to be vertical.

Now let the rod go. If most of the points are below it, the springs on the bottom will be longer, exert more force, and pull the rod down. Similarly, if the rod’s slope is shallower than that of trend in the points, the rod will be torqued up to a steeper slope. The final resting place of the rod is one sort of estimate of the best straight-line approximation of the pegs.

To see mathematically what this system does, remember that the energy stored in a spring of zero rest length is the square of its length. The system finds a stable static equilibrium, so it is at a minimum of potential energy. Thus, this best-fit line is the line that minimizes the squares of the lengths of the springs, or minimizes the squares of the residuals, as they’re called.

This picture lets us find a formula for the least-squares line. To be in equilibrium, the rod must have no force on it. The force exerted by a spring is proportional to its length, so the lengths of all the springs must add to zero. (We count length as negative if the spring is above the rod and positive otherwise.)

Mathematically, we’ll write the points as (x_i, y_i) and the line as y = mx+b. Then the no-net-force condition is written

\sum_i y_i - (mx_i+b) = 0

There must also be no net torque on the rod. The torque exerted by a spring (relative to the origin) is its length multiplied by its x_i. That means

\sum_i x_i \left(y_i - (mx_i + b)\right) = 0

These two equation determine the unknowns m and b. The reader will undoubtedly be unable to stop themselves from completing the algebra, finding that if there are N data points

m = \frac{\frac{1}{N}\sum_i x_iy_i - \frac{1}{N}\sum_i y_i \frac{1}{N}\sum_i x_i}{\frac{1}{N} \sum_i x_i^2 - (\frac{1}{N}\sum_i x_i)^2}

b = \frac{1}{N}\sum_i y_i - m \frac{1}{N} \sum_i x_i

These formulas clearly contain some averages. Let’s denote \frac{1}{N}\sum_i x_i = \langle x \rangle and similarly for y and combination of the two. Then we can rewrite the formulae as

m = \frac{\langle xy\rangle - \langle x \rangle \langle y\rangle }{\langle x^2\rangle - \langle x\rangle ^2}

\langle y \rangle = m \langle x \rangle + b

This is called a least-squares linear regression.

Variance

The story about the rod and minimizing potential energy is not the really the reason we use least-squares regression; it was only convenient illustration. Students are often curious why we do not, for example, minimize the sum of the absolute values of the residuals.

Take a look at the value \langle x^2\rangle - \langle x\rangle ^2 from the expression for the least-squares regression. This is called the variance of x. It’s a very natural measure of the spread of x – more so than the one you’d get by adding up the absolute values of the errors.

Suppose you have two variables, x and u. Then

\mathrm{var}(x+u) = \mathrm{var}(x) + \mathrm{var}(u) + \langle 2xu\rangle - 2\langle x \rangle \langle u \rangle

The reader is no doubt currently wearing a pencil down to the nub showing this.

If x and u are independent, the last two terms cancel (down to the nub!), and we have

\mathrm{var}(x+u) = \mathrm{var}(x) + \mathrm{var}(u)

In practical terms: flip a coin once and the number of heads has a variance of .25. Flip it a hundred times and the variance is 25, etc. This linearity property does not hold for absolute values.

So variance is a very natural measure of variation. Simple linear regression is nice, then, because it

  1. makes the mean residual zero
  2. minimizes the variance of the residuals
Defining the covariance as a generalization of the variance \mathrm{cov}(x,y) \equiv \langle xy\rangle - \langle x\rangle \langle y\rangle (so that \mathrm{var}(x) = \mathrm{cov}(x,x)), we can rewrite the slope m in the least-squares formula as
m = \frac{\mathrm{cov}(x,y)}{\mathrm{var}(x)}

The Distance Formula

The distance d of a point (x,y) from the origin is

d^2 = x^2 + y^2

In three dimensions, this becomes

d^2 = x^2 + y^2 + z^2

The generalization to n dimensions is clear.

If we imagine the residual as coordinates of a point in n-dimensional space, the simple linear regression is the line that brings that point in as close to the origin as possible, another cute visualization.

Further Reading

The physical analogy to springs and minimum energy comes from Mark Levi’s book The Mathematical Mechanic.¬†Amazon Google Books

The Wikipedia articles on linear regression and simple linear regression are good.

There’s much mathematical insight to be had at Math.Stackexchange, Stats.StackExchange and MathOverflow

Tags: , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: