Least Squares Linear Regression Analysis

C++ code: regression.cpp

Compile it like so: g++ -O3 regression.cpp -o regression

Sample test file: data.txt:

    3    1    4    3147
    1    5    9    1597
    2    6    5    2657
    3    5    8    3587
    9    7    9    9797
    3    2    3    3237
  
The input is a data point per line, where each line is n tab-separated or comma-separated numbers.

Run it like so: regression < data.txt . The output for that input is: 1000 100 10 7 . Whether it is separated by tabs or commas matches the input.

I doubt there is anything even vaguely original here. This is least-squares linear regression analysis, as found by Legendre and Gauss in the early 1800's.

The program takes the first n-1 columns of the input and finds the linear combination of them that best approximates the nth column, by minimizing sum(over all data points)(c1*col1 + c2*col2 + ... + cn-1*coln-1 + cn - coln)2. Minimizing that formula seems very ugly. There's a thousand of data points, and the formula has squares, it probably involves repeated approximations and maybe local minima. But no.

First, expand the algebra. Assuming 2 columns, sum(ax + b - y)2 is what we want to minimize. Looks ugly. Expanding the algebra we get sum(aaxx + 2axb - 2axy + bb*1 - 2by + yy), which still looks just as ugly. But you can collect terms. Define XX := sum(xx), XY := sum(xy), X := sum(x), Y := sum(y), YY := sum(yy). Then, if there are P data points, it becomes (aaXX + 2abX - 2aXY + bbP - 2bY + YY). No further sums, just that one (still ugly) equation.

But the real trick is that the equation is minimized where its derivative is zero. The derivate with respect to a is (2aXX + 2bX - 2XY = 0) and with respect to b is (2aX + 2bP - 2Y = 0). All the squares disappeared! Two linear equations, two unknowns (a and b)! Apply Gaussian Elimination and you're done.

For sum(ax + by + cz + d - v)2, the equations look like

aXX + bXY + cXZ + dX = XV
aXY + bYY + cYZ + dY = YV
aXZ + bYZ + cZZ + dZ = ZV
aX + bY + cZ + dP = V

So, this program just counts how many columns and data points there are, does the appropriate sums, forms the appropriate matrix, applies Gaussian Elimination, and tells you what the coefficients to the linear equation ought to be.

What if you want to find the best a,b,c for ax2+bx+c = y? Sounds really ugly. But, it turns out that that's simple too. In fact it's the same program. Feed in x2,x,y, and it solves for a,b,c. You can find the linear coefficients for the best approximation by any combination of functions of x that way: feed in f(x),g(x),h(x),y.



The calculus rant
Table of Contents