A Proof for the Method of Least Squares


Using the Method of Least Squares to Arrive at a Best-Fit Approximation for a Full Rank, Overdetermined System of Equations, Matrix A.

The following question was posed in my Linear Algebra (MAT 343) course:

It is a fact, which does not need to be explained, that for any overdetermined matrix A, the nullspace of A and the nullspace of ATA are the same. Based on the equality of the nullspaces of A and ATA, explain why an overdetermined system Ax=b has a unique least squares solution if A is full rank.

This is both an interesting and important question: in mathematics, systems of equations, frequently condensed into matrix form for ease in calculations, allow us to solve complex problems with multiple variables.  Typically, we are able to use matrix algebra to solve such systems, generally because the number of equations also equals the number of unknown variables.  However, if the system of equations is overdetermined, there are actually more equations than unknowns.  This is problematic because the system almost always lacks a solution now (i.e., it is said to be inconsistent), particularly if constructed with random coefficients.

Despite this problem, however, we can still approximate a very good solution by using the “least squares” method, so long as our Matrix A is full rank.  Incidentally, the method of least squares also happens to be a standard approximation approach of regression analysis.  Regression analysis is the statistical process used to estimate relationships between variables, including various modeling techniques for systems involving several variables.  Regression analysis is particularly useful in situations where an important relation exists between a dependent variable and one or more independent variables, and the method of least squares is commonly employed to analyze and solve such problems.

To use this method of least squares, we look for the solution x with the smallest error vector Ax−b, using some vector norm to determine the size of the error vector. In the least squares method, specifically, we look for the error vector with the smallest 2-norm (the “norm” being the size or magnitude of the vector).  While not perfect, the least squares solution does indeed provide a best-fit approximation where no other solution would ordinarily be possible.

But how can we prove that the method of least squares is valid?

In the following proof, we will show that the method of least squares is indeed a valid method that can be used to arrive at a reliable approximation of the solution if our system of equations, or matrix, is full rank; i.e., if all rows and columns of a square matrix are linearly independent (i.e., no vector in the set can be written as a linear combination of another), or, for a non-square matrix, if a maximum number of linearly independent column vectors exist or a maximum number of linearly independent row vectors exist.  Before beginning, we are also assured that the nullspaces of A, and ATA (which is symmetric to A), are the same (e.g., for matrix A, the nullspace is simply the set of all vectors v such that A⋅v=0).  Although the overdetermined system may have more equations than we need, the fact that the equations possess linear independence and a nullspace property will make it possible to arrive at a unique, best-fit approximation.


The proof devised is as follows: Assume that for Ax=b, there is no solution, and that Ax=b is overdetermined.  Though an exact solution does not exist, we can arrive at an approximation or “best fit” where Ax* is as close as possible to b (i.e., we can minimize the length of b – Ax*).  In this case, the “least squares” estimate is x* that would minimize this to approximately Ax=b.

We know that the closest vector to b is the projection of b onto the column space of A.  So, to minimize A*, where A* equals the projection of b onto the column space of A, Ax will need to be equal to the projection of b.  Put another way,

Ax* – b = projc(A)b – b,

where projc(A)b – b is “a,” and a is orthogonal to the column space of A. Thus,

Ax* – b ∈ c(A) .

We know that the orthogonal complement is the nullspace of AT, so

Ax* – b ∈ N(AT)

= AT(Ax*- b) = 0 (nullspace)

= ATAx*- ATb = 0

= ATAx* = ATb

Because of this, a unique “least squares” approximation exists for Ax=b.

Thus, when solving an overdetermined m x n system Ax = b, using least squares, we can use the equation (ATA)x = ATb.

Finally, if the rank of A is n, then ATA is invertible, and we can multiply through the normal equation by (ATA)-1 to obtain

x = (ATA)-1ATb

where the matrix (ATA)-1AT is the pseudoinverse of matrix A.


I hope that you enjoyed this proof and that it provides every confidence to use the method of least squares when confronted with a full-rank, overdetermined system.



  1. Leon, Linear Algebra with Applications (9th ed.), Pearson Education (2015).



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s