Andrew Abok Follow

Solving Least Squares Problems

Introduction

we are back at it with Linear Algebra,In of data science, engineering, and applied mathematics,least squares problems are fundamental. Whether you’re fitting a regression model, processing signals, or solving systems of linear equations, the least squares method often arises when an exact solution is impossible or impractical.

The linear least squares problem seeks the $n$ -by- $1$ vector $x$ that minimizes the Euclidean distance between the observed data and the model estimates:

min_{x} | | A x - b | |_{2}

Where:

$A$ is an $m \times n$ matrix of coefficients,
$b$ is an $m \times 1$ observation vector,
$x$ is the $n \times 1$ vector of unknowns we wish to find.

Depending on the relationship between $m$ (rows) and $n$ (columns), the problem can be Overdetermined ( $m > n$ ) these are scenarios where we have more equations than unknowns Underdetermined ( $m < n$ ) reverse of the above here we have more unknowns than equations Square ( $m = n$ ) this is considered a special case where direct solutions might exist.

Least squares has three standard methods of solving the problem

Normal Equations
QR decomposition
Singular value decomposition.
Transformation to a linear system

There is no single best method to solve it in practice the method you choose depends on the size and shape of the matrix, performance requirements and so on.This is the reason i ended up coming up with this write up, we were trying to determine which method should be the default that gets called whenever a request is made.

1.Normal Equations

This is the classical method for solving least squares problems introduced by Gauss.It solves the problem by transforming it into the form: $A^{T} A x = A^{T} b$

If the matrix A has full column rank (its columns are linearly independent),then in such a case, $A^{T} A$ is symmetric and positive definite, this property enables the use of Cholesky factorization which decomposes the matrix as, $A^{T} A = L L^{T}$ . where $L$ is a lower triangular matrix.The solution $x$ can then be computed by solving the triangular systems $L y = A^{T} b$ (forward substitution) and $L^{T} x = y$ (backward substitution).

This sequence of transformations can be thought of as:

Rectangular (A) \to Square (A^{T} A) \to Triangular (L, L^T)

Stability Considerations

While the normal equations method is exact in theory, in practice it can suffer from numerical instability. This is because when forming $A^{T} A$ , information loss due to rounding errors may occur, especially if $A$ is ill-conditioned.

Moreover, the condition number of $A^{T} A$ — which measures sensitivity to numerical errors is the square of the condition number of $A$ :

cond (A^{T} A) = [cond (A)]^{2}

This squaring effect can significantly worsen stability, making the normal equations unsuitable for poorly conditioned matrices.

let normalEquationsCholeskySafe (A: Matrix<float>) (b: Vector<float>) =
    let At = A.Transpose()
    let AtA = At * A
    let Atb = At * b

    if isPositiveDefinite AtA then        
        let cholesky = AtA.Cholesky()
        cholesky.Solve Atb
    else
        printfn "Matrix is NOT positive definite."
        AtA.Solve Atb

2.QR Decomposition

QR decomposition is a factorization method for an $m \times n$ matrix $A$ (with $m \geq n$ ),expressing it as the product of an orthogonal matrix $Q$ and an upper triangular matrix $R$ :

$A = Q R$ Where:

$Q$ is an $m \times m$ orthogonal matrix ( $Q^{T} Q = I$ )
$R$ is an $m \times n$ upper triangular matrix.

QR decomposition simplifies solving linear least squares problems mainly because it transforms the problem into solving a triangular system.

In simple terms, Householder reflectors use reflection to zero out lower elements of a column. Conceptually, you transform a column vector $a_{k}$ into a vector aligned with a coordinate axis using a Householder matrix:

Construction of QR Decompostion

The core idea in QR revolves around orthogonalizing the columns of $A$ to produce an orthonomal basis (the matrix $Q$ ).This can be achieved using either Householder transformation or Gram-Schmidt orthogonalization.

Rather than diving into full derivations of the Householder trnaformation (which is widely used in practice given its stability), you can refer to this detailed lecture note for the mathematics.

In simple terms, Householder reflectors use reflection to zero out lower elements of a column. Conceptually, you transform a column vector $a_{k}$ into a vector aligned with a coordinate axis using a Householder matrix:

H_{k} = I - 2 \frac{v v^{T}}{v^{T} v}

Where $v = a_k \pm

a_k

_2 e_1

a n d

e_1

i s t h e u n i t v e c t o r

[1, 0, …, 0]^T$. The sign is chosen to avoid cancellation errors.

Applying this reflectors sequentially yields $A = Q R$ with $Q = H_{1}, H 2, . . H_{n}$ and $R$ as the resulting upper triangular matrix.

An implementation in F# would look like this:

let householderQR (A: Matrix<float>) =
    let m = A.RowCount
    let n = A.ColumnCount
    let tau = DenseVector.zero n

    let mutable A = A.Clone()

    for j in 0 .. n - 1 do        
        let x = A.SubMatrix(j, m - j, j, 1).Column(0)
        let normx = x.L2Norm()
        let s = if x.[0] >= 0.0 then -1.0 else 1.0
        let u1 = x.[0] - s * normx

        let w = x / u1
        w.[0] <- 1.0
        
        for i in 1 .. w.Count - 1 do
            A.[j + i, j] <- w.[i]
        
        A.[j, j] <- s * normx       
        tau.[j] <- -s * u1 / normx

        //reflection to the matrix remaing bits.
        let subA = A.SubMatrix(j, m - j, j + 1, n - j - 1)

        let wT_A = w.ToRowMatrix() * subA
        let update = tau.[j] * w.ToColumnMatrix() * wT_A
        A.SetSubMatrix(j, m - j, j + 1, n - j - 1, subA - update)

    A, tau

Least Squares Using QR

QR factorization preserves the Euclidean norm, making it numerically stable. Once $A = Q R$ is known, we rewrite the least squares problem:

min_{x} | | A x - b | |_{2} \Rightarrow min_{x} | | Q R x - b | |_{2}

Since $Q$ is orthogonal, this becomes:

min_{x} | | R x - Q^{T} b | |_{2}

Which leads to solving the triangular system:

$R x = Q^{T} b$

//we apply the Q^T to vector b
let applyQTransposeToB (A: Matrix<float>) (tau: Vector<float>) (b: Vector<float>) : Vector<float> =
    let m = A.RowCount
    let n = tau.Count
    let bCopy = b.Clone()

    for j in 0 .. n - 1 do
        let w = DenseVector.init (m - j) (fun i ->
            if i = 0 then 1.0 else A.[j + i, j]
        )
        let dot = w.DotProduct(bCopy.SubVector(j, m - j))
        let scale = tau.[j] * dot
        for i in 0 .. m - j - 1 do
            bCopy.[j + i] <- bCopy.[j + i] - scale * w.[i]

    bCopy

This can be efficiently solved using back substitution.

//finally we back substitute to solve Rx = Q^T b
let backSubstitute (R: Matrix<float>) (b: Vector<float>) : Vector<float> =
    let n = R.ColumnCount
    let x = DenseVector.zero n
    for i in [n - 1 .. -1 .. 0] do
        let sum = 
            [i + 1 .. n - 1]
            |> List.sumBy (fun j -> R.[i, j] * x.[j])
        x.[i] <- (b.[i] - sum) / R.[i, i]
    x

3. Singular value Decomposition.

SVD decomposition does really originate from the singular values and Eigen matrices concept.Signular values decompostion has the form: $A = U Σ V^{T}$ Where $U$ is an $m \times m$ orthogonal matrix, $V$ is an $n \times n$ orthongnal matrix and $Σ$ is an $m \times n$ diagonal matrix. It is a very important decomposition of a matrix and tells us a lot about its structure.

3.1 Geometric Interpretation

SVD can be understood geometrically as:

Rotation/Reflection by $V^{T}$ — reorients the coordinate system in the domain.
Scaling by $Σ$ — stretches or compresses along the new coordinate axes.
Rotation/Reflection by $U$ — reorients the result in the range.

The range space and null space are directly revealed by $U$ and $V$ :

Columns of $V$ corresponding to zero singular values form an orthonormal basis for the null space of $A$ .
Remaining columns of $V$ span the orthogonal complement of the null space.
Columns of $U$ corresponding to nonzero singular values form an orthonormal basis for the range of $A$ .
Remaining columns of $U$ span the orthogonal complement of the range.

Ordinary Least Squares

For $A x \approx b$ , the standard least squares solution minimizes: $‖ A x - b ‖_{2}$ When $A$ has full column rank, the solution is: $x = A^{+} b = V Σ^{+} U^{T} b$ where $A^{+}$ is the Moore–Penrose pseudoinverse and $Σ^{+}$ is obtained by taking reciprocals of the nonzero singular values in $Σ$ .

In ordinary least squares, we assume $A$ is exact and only $b$ has measurement errors.
In Total Least Squares (TLS), both $A$ and $b$ are subject to error, and we instead minimize the orthogonal distance from $[A b]$ to a rank-deficient matrix.

This is solved by computing: $[A b] = U Σ V^{T}$

If $σ_{n + 1}$ is the smallest singular value (assumed simple) and $v_{n + 1, n + 1} \neq 0$ ,then the TLS solution is:

x_{TLS} = - \frac{v_{1 : n, n + 1}}{v_{n + 1, n + 1}}

Here is an F# implementation

let solveLeastSquaresSVD (A: Matrix<float>) (b: Vector<float>) : Vector<float> =
    let svd = A.Svd true
    let s = svd.S.ToArray()
    let tolerance = 1e-10 * float (max A.RowCount A.ColumnCount) * s.[0]    
    let rank = s |> Array.filter (fun x -> x > tolerance) |> Array.length

    let sigmaInv = DenseMatrix.init rank rank (fun i j ->
        if i = j && s.[i] > tolerance then 1.0 / s.[i] else 0.0)

    let U = svd.U.SubMatrix(0, A.RowCount, 0, rank)
    let Vt = svd.VT.SubMatrix(0, rank, 0, A.ColumnCount)

    let pseudoInverse = Vt.Transpose() * sigmaInv * U.Transpose()
    pseudoInverse * b

Conclusions

We run a benchmark test to compare the performance of all this methods built above by simulating and getting the average time taken for each method.

let benchmark name n (f: unit -> Vector<float>) =
    let sw = Stopwatch()
    let times = ResizeArray<float>()
    let mutable last = DenseVector.zero 1

    for _ in 1 .. n do
        sw.Restart()
        last <- f()
        sw.Stop()
        times.Add(float sw.Elapsed.TotalMilliseconds)

    let avgTime = Seq.average times
    printfn "%s ran %d times. Avg time: %.3f ms" name n avgTime
    name, times |> List.ofSeq

and finally utilized the plotly.Net library for visualization.

let chart =
    results
    |> List.map (fun (name, times) ->
        let indexedTimes = times |> List.mapi (fun i t -> i, t) 
        Chart.Line(indexedTimes, Name = name)
    )
    |> Chart.combine
    |> Chart.withTitle "Solver Benchmark Times (ms)"
    |> Chart.withXAxisStyle (TitleText = "Run Index")
    |> Chart.withYAxisStyle (TitleText = "Time per Run (ms)")

chart |> Chart.show

The Full code can be found here

On running the above solvers, a clear trade-off emerges between computational efficiency and numerical stability. The normal equations with Cholesky factorization are by far the fastest, but their reliability strongly depends on the conditioning of A A.When $A^{T} A$ is nearly singular, round-off errors accumulate, leading to unstable solutions. Our inclusion of positive definiteness check mitigates this but does not eliminate the fragility of the method.

The QR decomposition provides a more balanced compromise.Our custom Householder implementation validates the theoretical steps of the algorithm, while Math.NET’s optimized QR consistently outperforms it in practice. QR is thus the most practically robust solver for large well posed problems, combining stability with speed.

The singular value decomposition (SVD) stands apart as the most reliable method, as it explicitly handles rank deficiency and ill-conditioning. Its downside is computational cost, which our benchmarks clearly reveal. Nevertheless, SVD is indispensable when accuracy is paramount, or when the system is ill-conditioned. This is especially evident in the context of Total Least Squares, where the smallest singular value encodes the perturbation to both A A and b, yielding the corrected TLS solution.

There numerous texts that have explored these various methods for solving linear least squares in case you want to diver deeper into it. In practice, one must therefore match the solver to the problem:

Cholesky for speed when A is well conditioned,
QR as the general purpose workhorse,
SVD for ill-conditioned or rank-deficient systems, and

TLS when modeling noise in both A and b.

18 Aug 2025

« Review of "A Path Towards Autonomous Machine Learning Intelligence" by Yann LeCun (Version 0.9.2, June 27, 2022) Debugging Mindset »