Least squares is a basic way to fit a curve to data. Our treatment will follow TAK § 4.5.
Least squares gives a method of fitting a curve to discrete data points. Usually we want to approximate by a polynomial of some given degree, though other approximations are also possible.
It also gives a method of approximating a continuous curve by a polynomial (or trigonometric function). The section TAK § 4.5 actually gives more information on using the $L_2$-norm to find a polynomial curve close to a non-polynomial curve.
# Standard Imports
import numpy as np
import matplotlib.pyplot as plt
import math as m
from mpmath import mp, iv
from scipy import linalg
On April 1, 2020, the MTA's Introduction to Subway Ridership page gives the following data for total rides over the previous several years.
Year | Annual Total |
2013 | 1,707,555,714 |
2014 | 1,751,287,621 |
2015 | 1,762,565,419 |
2016 | 1,756,814,800 |
2017 | 1,727,366,607 |
2018 | 1,680,060,402 |
We might be interested in fitting a curve to this data. For example, we might be interested in using the curve to guess what next years total would be.
The following gives the data in a list of pairs (year, total)
.
ridership_data = [ (2013, 1707555714),
(2014, 1751287621),
(2015, 1762565419),
(2016, 1756814800),
(2017, 1727366607),
(2018, 1680060402)]
Here we do some work to plot these data points. We can organize the years we have data for in a list we call years
with the following.
years = []
for year,total in ridership_data:
years.append(year)
years
We can use list Python list comprehension to construct years
in a more compact way:
years = [year for year,total in ridership_data]
We similarly organize the list of total rides.
totals = [total for year,total in ridership_data]
totals
plt.plot(years,totals,"o")
plt.title("Total ridership each year")
plt.xlabel("Year")
plt.ylabel("Annual total rides")
plt.show()
Surprisingly this looks almost like a parabola! So, it seems reasonable to try to approximate the data by a degree two polynomial: $$F(Y)=a + bY + cY^2.$$ Here $Y$ represents a year, and we are hoping that $F(Y)$ is very close to the ridership that year. We are looking to find a "best" choice of $F$. We will use the least squares (or $L^2$ notion of best). We will look for the function $F$ which makes the quantity $$\sum_{Y=2013}^{2018} \big(F(Y)-T_Y\big)^2$$ smallest, where $T_Y$ is the total riders in year $Y$. Expanding our definition of $F$, we can write the function we are hoping to minimize as $$M(a,b,c) = \sum_{Y=2013}^{2018} \big(a + bY + cY^2-T_Y\big)^2$$ We know from Calculus, at the minimizer we would have $$\frac{\partial M}{\partial a} = \frac{\partial M}{\partial b} = \frac{\partial M}{\partial c} = 0.$$ We compute $$ \begin{array}{rcl} \frac{\partial M}{\partial a}(a,b,c) & = & 2 \sum_{Y=2013}^{2018} \big(a + bY + cY^2-T_Y\big),\\ \frac{\partial M}{\partial b}(a,b,c) & = & 2 \sum_{Y=2013}^{2018} \big(a + bY + cY^2-T_Y\big) Y, \quad \text{and}\\ \frac{\partial M}{\partial b}(a,b,c) & = & 2 \sum_{Y=2013}^{2018} \big(a + bY + cY^2-T_Y\big) Y^2.\end{array}$$ Setting each of these equations equal to zero, dividing by $2$, and moving the terms that do not depend on $a$, $b$ or $c$ to the other side yields three linear equations: $$\begin{array}{rrrrrrr} a \sum_{Y=2013}^{2018} 1 & + & b \sum_{Y=2013}^{2018} Y & + & c \sum_{Y=2013}^{2018} Y^2 & = & \sum_{Y=2013}^{2018} T_Y, \\ a \sum_{Y=2013}^{2018} Y & + & b \sum_{Y=2013}^{2018} Y^2 & + & c \sum_{Y=2013}^{2018} Y^3 & = & \sum_{Y=2013}^{2018} Y T_Y, \\ a \sum_{Y=2013}^{2018} Y^2 & + & b \sum_{Y=2013}^{2018} Y^3 & + & c \sum_{Y=2013}^{2018} Y^4 & = & \sum_{Y=2013}^{2018} Y^2 T_Y. \end{array}$$ We can write this as a matrix equation: $$ \left(\begin{matrix} \sum_{Y=2013}^{2018} 1 & \sum_{Y=2013}^{2018} Y & \sum_{Y=2013}^{2018} Y^2 \\ \sum_{Y=2013}^{2018} Y & \sum_{Y=2013}^{2018} Y^2 & \sum_{Y=2013}^{2018} Y^3 \\ \sum_{Y=2013}^{2018} Y^2 & \sum_{Y=2013}^{2018} Y^3 & \sum_{Y=2013}^{2018} Y^4 \\ \end{matrix}\right) \left(\begin{matrix} a \\ b \\ c \end{matrix}\right)= \left(\begin{matrix} \sum_{Y=2013}^{2018} T_Y \\ \sum_{Y=2013}^{2018} Y T_Y \\ \sum_{Y=2013}^{2018} Y^2 T_Y \end{matrix}\right). $$
M = np.zeros((3,3), dtype=int)
M
for i in range(3):
for j in range(3):
for year in years:
M[i,j] += year**(i+j)
M
for i in range(3):
for j in range(3):
M[i,j] = sum([year**(i+j) for year,total in ridership_data])
M
r = np.zeros(3, dtype=int)
for i in range(3):
r[i] = sum([year**i * total for year,total in ridership_data])
r
coefs = linalg.solve(M,r)
coefs
a,b,c = coefs
F = lambda Y: a + b*Y + c*Y**2
years = [year for year,total in ridership_data]
totals = [total for year,total in ridership_data]
actual, = plt.plot(years,totals,"o", label="Actual ridership")
x = np.linspace(years[0],years[-1])
y = F(x)
least_squares, = plt.plot(x, y, "r", label="least squares approx")
plt.title("Total ridership each year")
plt.xlabel("Year")
plt.ylabel("Annual total rides")
legend = plt.legend(handles=[actual,least_squares], loc='lower left')
plt.show()
The function looks like a good fit, so we might guess that the ridership in $2019$ will be approximatey:
F(2019)
A (real) vector space is a set $V$ which is closed under operations of
An inner product on $V$ is an map $V \times V \to {\mathbb R}$ satisfying certain properties. The most familiar example is the dot product on ${\mathbb R}^n$. We'll denote a general inner product by $\langle v, w\rangle$. The properties are:
(By bilinearity, it is true that $\langle 0, w \rangle = 0$ for every $w \in V$.) It is said to be
A positive-definite inner product naturally leads to a distance function, where the distance from $v$ to $w$ is $$d(v,w) = \sqrt{\langle v-w, v-w \rangle}.$$ The norm of a vector $v$ is its distance from $0$, which is usually denoted $\|v\|$. We have $$\|v\|=\sqrt{\langle v,v\rangle}.$$ In the positive semi-definite case, you can still make this definition but the distance is degenerate in the sense that sometimes $d(v,w)=0$ even if $v \neq w$.
Two vectors $v$ and $w$ are orthogonal if $\langle v, w\rangle=0$.
Let $W$ be a subset of a vector space $V$. We say $W$ is a subspace if it is closed under addition and scalar multiplication. We say $\langle,\rangle$ is positive-definite on $W$ if $\langle w, w \rangle>0$ for every non-zero $w \in W$.
An example of a subspace of all continuous functions ${\mathbb R} \to {\mathbb R}$ is the polynomials of degree less than $d$.
Theorem. If $W$ is finite dimensional and the inner product $\langle,\rangle$ on $V$ is positive-definite on $W$, then for every $v_\ast \in V$ there is a unique $w_\ast \in W$ that minimizes the distance from $v_\ast$. The vector $v_\ast - w_\ast$ is uniquely determined by the condition that it is orthogonal to every vector in $W$.
We call $w_\ast$ the projection of $v_\ast$ onto $W$, and write $w_\ast=proj_W(v_\ast)$. Proof of the theorem is beyond the scope of these notes (and in this generality, this class).
In least squares, we usually have a collection of inputs $t_0, \ldots, t_{m-1}$ and outputs measurements say $M(t_0), \ldots, M(t_{m-1})$. The possible outputs lie in ${\mathbb R}^{m}$ with our data representing the vector $$v_\ast = \left(\begin{matrix} M(t_0)\\ \vdots \\ M(t_{m-1}) \end{matrix}\right) \in {\mathbb R}^m.$$
If we want to approximate by polynomials of degree less than or equal to $d$, then we consider the subspace $$W = \textit{span} \left\{ \left(\begin{matrix} 1 \\ \vdots \\ 1 \end{matrix}\right), \left(\begin{matrix} t_0 \\ \vdots \\ t_{m-1} \end{matrix}\right), \left(\begin{matrix} t_0^2 \\ \vdots \\ t_{m-1}^2 \end{matrix}\right), \ldots \left(\begin{matrix} t_0^d \\ \vdots \\ t_{m-1}^d \end{matrix}\right)\right\}.$$ Vectors in $W$ represent restrictions of polynomials of degree less than $d$ to the points $t_0, \ldots, t_{m-1}$.
The quantity we want to minimize is $$\sum_{j=0}^{m-1} \big(w_j - M(t_j)\big)^2=d(v_\ast, w)^2.$$ where $w$ is taken over all vectors in $W$. Thus we want the closest $w$ to $v_\ast$, $w_\ast=proj_W(v_\ast)$.
I want to repeat our study of the example of ridership on the subway from this higher level viewpoint. The vector we are trying to approximate with a quadratic polynomial is totals
:
totals
The entries represent values each year, as given in the years
vector:
years
We want to find an approximation by a quadratic polynomial of the form $$a + bx + cx^2.$$ But, we only want to compare years where we have data. The following converts a triple of coefficients to the vector of values taken at each year.
def coefficients_to_vector(triple):
a,b,c = triple
return_list = []
for year in years:
return_list.append(a + b*year + c*year**2)
return np.array(return_list)
It is important to realize that are working with two coordinate systems.
The first coordinate system is for working with functions from years to the real numbers. Our original data is presented in this coordinate system. Since there are six years in our data, we think of our original data as giving an element of ${\mathbb R}^6$ with each coordinate representing the number of riders in the corresponding year.
The second coordinate system is given by coefficients of quadratic polynomials. The polynomial $a + bx+c x^2$ will be represented by the triple of coordinates $(a,b,c) \in {\mathbb R}^3$.
We can evaluate a polynomial on a year in our data set. Thus quadratic polynomials have a second coordinate representation in ${\mathbb R}^6$ given by evaluating the polynomial on the years we have data for. The coefficient_to_vector
function above converts from coefficients to this vector in ${\mathbb R}^6$. The quadratic polynomials thus give a three dimensional subspace $W \subset {\mathbb R}^6$.
The least squares approximation for our data is obtained by orthogonally projecting our data point to $W$. That projection lies in $W$, and to think of it as a quadratic polynomial we want its representation as a triple of coefficients.
The natural basis for the space $W$ of quadratic polynomials is $t \mapsto 1$, $t \mapsto t$, and $t \mapsto t^2$. As coefficient vectors these are given by the following:
w0_coef = np.array([1,0,0])
w1_coef = np.array([0,1,0])
w2_coef = np.array([0,0,1])
Below we convert these to vectors of values in ${\mathbb R}^6$. So w2_coef
is a polynomial given as coefficients and w2_vector
will be the same polynomial viewed as a vector of values in ${\mathbb R}^6$.
w0_vector = coefficients_to_vector(w0_coef)
w1_vector = coefficients_to_vector(w1_coef)
w2_vector = coefficients_to_vector(w2_coef)
print(w0_vector, w1_vector, w2_vector)
Now we use Gram-Schmidt orthogonalization to convert our basis for $W$ into an orthogonal basis. We want to be able to access these basis elements in both coordinate systems. Note that we will only be using the dot product for comparing vectors (not coefficients). This is because we are interested in finding an orthogonal basis in ${\mathbb R}^6$, where we will be doing the projection.
y0_vector = w0_vector
y0_coef = w0_coef
y1_vector = w1_vector - ((w1_vector @ y0_vector)/(y0_vector @ y0_vector)) * y0_vector
y1_coef = w1_coef - ((w1_vector @ y0_vector)/(y0_vector @ y0_vector)) * y0_coef
# Sanity check that y1_vector is orthogonal to y0_vector
y0_vector @ y1_vector
y2_vector = w2_vector - ((w2_vector @ y0_vector)/(y0_vector @ y0_vector)) * y0_vector \
- ((w2_vector @ y1_vector)/(y1_vector @ y1_vector)) * y1_vector
y2_coef = w2_coef - ((w2_vector @ y0_vector)/(y0_vector @ y0_vector)) * y0_coef \
- ((w2_vector @ y1_vector)/(y1_vector @ y1_vector)) * y1_coef
# Sanity check that y2_vector is orthogonal to y0_vector and y1_vector:
print(y0_vector @ y2_vector)
print(y1_vector @ y2_vector)
Now we have an orthogonal basis of quadratic polynomials. We can compute the projection of totals
onto this space:
proj_vector = ((totals @ y0_vector)/(y0_vector @ y0_vector)) * y0_vector \
+ ((totals @ y1_vector)/(y1_vector @ y1_vector)) * y1_vector \
+ ((totals @ y2_vector)/(y2_vector @ y2_vector)) * y2_vector
It should be true that the original minus the projection should be orthogonal to the subspace. To check the angle made between totals-proj_vector
and y0_vector
, we can use the dot product. Recall that for two vectors $v$ and $w$.
$$v \cdot w = \|v\| \|w\| \cos \theta$$ where
$\theta$ is the angle between them. Thus, we have
$$\cos \theta = \frac{v \cdot w}{\|v\| \|w\|}.$$
We can get the length of a vector $v$ using np.linalg.norm(v)
.
print(((totals-proj_vector) @ y0_vector) / (np.linalg.norm(totals-proj_vector)*np.linalg.norm(y0_vector)))
print(((totals-proj_vector) @ y1_vector) / (np.linalg.norm(totals-proj_vector)*np.linalg.norm(y1_vector)))
print(((totals-proj_vector) @ y2_vector) / (np.linalg.norm(totals-proj_vector)*np.linalg.norm(y2_vector)))
Note that there are some numerical issues: we are not getting exactly zero.
We really want to get the coefficients. We use the same weights on coefficients for the projection.
proj_coef = ((totals @ y0_vector)/(y0_vector @ y0_vector)) * y0_coef \
+ ((totals @ y1_vector)/(y1_vector @ y1_vector)) * y1_coef \
+ ((totals @ y2_vector)/(y2_vector @ y2_vector)) * y2_coef
proj_coef
Note that this recovers essentially the same coefficients as our elementary approach resulted with before:
coefs
Since they are slightly different, lets plot it again with the new coefficients.
a,b,c = proj_coef
F = lambda x: a + b*x + c*x**2
years = [year for year,total in ridership_data]
totals = [total for year,total in ridership_data]
actual, = plt.plot(years,totals,"o", label="Actual ridership")
x = np.linspace(years[0],years[-1])
y = F(x)
least_squares, = plt.plot(x, y, "r", label="least squares approx")
plt.title("Total ridership each year")
plt.xlabel("Year")
plt.ylabel("Annual total rides")
legend = plt.legend(handles=[actual,least_squares], loc='lower left')
plt.show()
We can actually compare our two solutions and see which is better. We'll compute the norm of the difference between totals and the vector arising from our coefficients using our first computation:
np.linalg.norm(coefficients_to_vector(coefs)-totals)
Here is the same computation for the second way of computing things using the projection.
np.linalg.norm(coefficients_to_vector(proj_coef)-totals)
The second number is a bit smaller, so our second approach is slightly better.
The Legendre polynomials give a basis for polynomials on $[-1,1]$ of degree $\leq d$ where the inner product is $$\langle f, g\rangle = \int_{-1}^1 f(x)g(x)~dx.$$ The first few are in TAK in equation (4.36) on page 123. They normalize the polynomials so that $p(1)=1$.
We will use a vector $[a_0, a_1, \ldots, a_d]$ to represent the polynomial $$t \mapsto a_0 + a_1 t + a_2 t^2 + \ldots + a_d t^d.$$
def coef_to_polynomial(coefs):
def f(t):
total = 0.0
for j in range(len(coefs)):
total += coefs[j] * t**j
return total
return f
The product of a polynomial of degree $d_1$ and a polynomial of degree $d_2$ is a polynomial of degree $d_1+d_2$. The following computes the coefficients of products.
def coef_product(coefs1, coefs2):
return_coefs = np.zeros(len(coefs1)+len(coefs2)-1)
for i in range(len(coefs1)):
for j in range(len(coefs2)):
return_coefs[i+j] += coefs1[i]*coefs2[j]
return return_coefs
Unfortuately, we also can't just add polynomials because you can only add vectors with the same number of coefficients.
def coef_add(coefs1, coefs2):
N = max(len(coefs1), len(coefs2))
return_coefs = np.zeros(N)
for j in range(len(coefs1)):
return_coefs[j] += coefs1[j]
for j in range(len(coefs2)):
return_coefs[j] += coefs2[j]
return return_coefs
To test our function observe that $(1+2x)(3+4x)=3+10x+8x^2$. We can compute this:
coef_product([1,2],[3,4])
The following takes coefficients for a polynomial $p$ and returns $\int_{-1}^1 p(t)~dt$. Note that $$\int_{-1}^1 t^j~dt = \begin{cases} 0 & \text{if $j$ is odd,} \\ \frac{2}{j+1} & \text{if $j$ is even.} \end{cases}$$
def coef_integral(coefs):
total = 0.0
for j in range(0, len(coefs), 2):
total += 2 * coefs[j] / (j+1)
return total
Here we check our work:
coef_integral([1,1,1])
With the functions above we can define the inner product:
def ip(coefs1, coefs2):
return coef_integral(coef_product(coefs1, coefs2))
Suppose we want the first $D$ Legendre Polynomials. Following the book we'll use
D = 4
The usual basis for the polynomials of degree $D$ is given by below:
W = []
for j in range(D+1):
W.append(np.array(j*[0] + [1]))
W
Now we use Gram-Schmidt to produce an orthogonal basis.
Y = [W[0]]
for j in range(1, D+1):
w = W[j]
y = w.copy()
for k in range(j):
y = coef_add(y, - (ip(w,Y[k])/ip(Y[k],Y[k])) * Y[k])
Y.append(y)
Y
These functions are not normalized in the sense that $p(1)=1$. Here we produce a normalized list:
YN = []
for y in Y:
f = coef_to_polynomial(y)
YN.append(y / f(1))
YN
We can see that the polynomials whose coefficients are given as above are the same as the Legendre Polynomials listed in (4.36).
The book has several problems and examples using continuous least squares approximation over intervals. Here you want to find a projection to a space of polynomials using an inner product coming from an integral. Once you have an orthogonal basis, you can use the usual projection formula to compute the least squares approximation.
Let's try to find the best approximation to $\cos(x)$ on the interal $[-1,1]$ by a polynomial of degree less than four. This will be given by $$\sum_{j=0}^4 \frac{\langle y_j, \cos \rangle}{\langle y_j, y_j \rangle} y_j$$ where $y_0, \ldots, y_4$ represents the Legendre polynomials of degree up to $4$.
We don't currently have the ability to evaluate $$\langle y_j, \cos \rangle= \int_{-1}^1 y_j(t) \cos(t)~dt.$$ Since $y_j$ is a polynomial of degree less than $4$, to figure this out from the coefficients we need to know $$\int_{-1}^1 t^d \cos(t)~dt.$$ for $k \in \{0,1,2,3,4\}$. Observe that if $k$ is odd, we are integrating an odd function over the symmetric integral $[-1,1]$ so this evaluates to zero. Now consider the even case. When $k=0$, we have $$\int_{-1}^1 t^0 \cos(t)~dt=\int_{-1}^1 \cos(t)~dt=[\sin t]_{-1}^1 = 2 \sin(1). \tag{1}$$ We can get an inductive formula for the other integrals using integration by parts. If $k \geq 2$ is even, we have $$\int_{-1}^1 t^k \cos(t)~dt=[t^k \sin(t)]_{-1}^1 - \int_{-1}^1 k t^{k-1} \sin(t) ~dt.$$ Applying integration by parts again yields $$\int_{-1}^1 t^k \cos(t)~dt=2 \sin(1) - [-k t^{k-1} \cos(t)]_{-1}^1 + \int_{-1}^1 -k (k-1) t^{k-2} \cos(t)~dt.$$ Simplifying, we see $$\int_{-1}^1 t^k \cos(t)~dt=2 \sin(1) + 2 k \cos(1) - k (k-1) \int_{-1}^1 t^{k-2} \cos(t)~dt. \tag{2}$$ Using (1) and (2) we get an inductive formula for $$z_k = \int_{-1}^1 t^k \cos(t)~dt.$$ We update the value of $z$ as we move through the even entries of the polynomial in the function below.
def integrate_poly_times_cos(coefs):
# Return the integral from -1 to 1 of the function
# p(t) * cos(t), where p(t) is the polynomial with
# the provided coefficients.
z = 2*m.sin(1)
total = coefs[0] * z
for k in range(2, len(coefs), 2):
# Update the value of z:
z = 2*m.sin(1) + 2*k*m.cos(1) - k*(k-1)*z
# Update the total integral
total += coefs[k] * z
return total
Here we compute the degree $0$ best approximate, using the list YN[:1]
of Legendre Polynomials up to degree 0:
proj_coefs = np.array([0])
for y in YN[:1]:
proj_coefs = coef_add( proj_coefs, \
(integrate_poly_times_cos(y)/ip(y,y)) * y)
proj_coefs
f = coef_to_polynomial(proj_coefs)
x = np.linspace(-1,1)
cos, = plt.plot(x, np.cos(x), label="cos(x)")
approx, = plt.plot(x, f(x), "r", label="approximate")
plt.title("The degree 0 best approximate of cos(x)")
legend = plt.legend(handles=[cos, approx], loc='lower center')
plt.show()
Since a degree zero polynomial is a constant function, we can't expect to get that close. But, $\cos(x)$ looks like a parabola, so degree $2$ should work better.
proj_coefs = np.array([0])
for y in YN[:3]:
proj_coefs = coef_add( proj_coefs, \
(integrate_poly_times_cos(y)/ip(y,y)) * y)
proj_coefs
f = coef_to_polynomial(proj_coefs)
x = np.linspace(-1,1)
cos, = plt.plot(x, np.cos(x), label="cos(x)")
approx, = plt.plot(x, f(x), "r", label="approximate")
plt.title("The degree 2 best approximate of cos(x)")
legend = plt.legend(handles=[cos, approx], loc='lower center')
plt.show()
Here we plot the error $cos(x)-f(x)$.
x = np.linspace(-1,1)
plt.plot(x, np.cos(x) - f(x))
plt.show()
Now we compute the degree four best approximate.
proj_coefs = np.array([0])
for y in YN:
proj_coefs = coef_add( proj_coefs, \
(integrate_poly_times_cos(y)/ip(y,y)) * y)
proj_coefs
f = coef_to_polynomial(proj_coefs)
x = np.linspace(-1,1)
cos, = plt.plot(x, np.cos(x), label="cos(x)")
approx, = plt.plot(x, f(x), "r", label="approximate")
plt.title("The degree 4 best approximate of cos(x)")
legend = plt.legend(handles=[cos, approx], loc='lower center')
plt.show()
Now the approximate is completely on top of the graph of $\cos(x)$. We plot the error below.
x = np.linspace(-1,1)
plt.plot(x, np.cos(x) - f(x))
plt.show()
Note that the order of the difference is much smaller.