Andrew Abok Follow

The Levenberg-Marquardt algorithm

February 4, 2025 · 9 min read · Optimization

Non linear Least Squares

Nonlinear least squares regression extends linear least squares regression for use with a much larger and more general class of functions.

A nonlinear model is any model of the form $y = f(x;\beta) + \epsilon$ our y in this case could take the form $y = \frac{\beta_0X}{\beta_1 + X}$

Consider a set of m possibly non linear equations in n variables $x = (x_1,…x_n)$ which can be written as $f_i(x)=0,i=1,..,m$. for any x we call $f_i(x)$ which is the ith residual.

we can seek an approximate solution by finding x that minimizes the sum of squares of the residuals

\[f_1(x)^2 +..+f_m(x)^2 = ||f(x)||^2 also || f(x)||^2 \geq ||f\hat{x}||^2\]

simply put this translates to

\[minimize||f(x)||^2\]

the optimal condition to be met is the partial derivatives of $f(x)$ with respect to each $x_1,….,x_n$ must vanish at $\hat{x}$

Gauss–Newton Method

The Gauss-Newton method is a modification of Newton’s method (a root-finding algorithm). Instead of using the second derivative (Hessian matrix), it uses a line search for the search direction.

In Newton’s method, the step size update requires calculating the full Hessian matrix $H$, which involves second-order derivatives of the total error function. This can be complex. The Gauss-Newton method simplifies this by using the Jacobian matrix to approximate the Hessian.

Mathematical Description

Sum of Squared Errors: $E(x, w) = \frac{1}{2} \sum_{m=1}^{M} e_m^2$ Where:

( x ) is the input vector
( w ) is the weight vector
( e ) is the error for the m-th data point defined as: $e_m = y_m - f(x_m,w)$

Gradient of the Error Function: The gradient ( g ) is defined as: $g = \frac{\partial E(x, w)}{\partial w}$

Integrating the error function and the gradient function gives us: $g_i = \frac{\partial E}{\partial w_i} = \frac{\partial \left( \frac{1}{2} \sum_{m=1}^{M} e_m^2 \right)}{\partial w_i}$

The relationship between the gradient vector and the Jacobian matrix is: $g = J e$

Hessian Matrix: The Hessian matrix ( H ) is given by: $h_{i,j} = \frac{\partial^2 E}{\partial w_i \partial w_j} = \frac{\partial^2 \left( \frac{1}{2} \sum_{m=1}^{M} e_m^2 \right)}{\partial w_i \partial w_j}$

The relationship between the Hessian and the Jacobian matrix is approximated as: $H \approx J^T J$

Gauss-Newton Algorithm: The update rule for the Gauss-Newton algorithm is: $w_{k+1} = w_{k} - (J^T_k J_k)^{-1} J_k e_k$

Pseudo code

Initialize weights $w_0$.
Compute the Jacobian matrix $J$ and the error vector $e$.
Update the weights using: $w_{k+1} = w_k - (J^T J)^{-1} J^T e$.
Repeat until convergence.

Implementation in F#

For this as well we will do the implementation in F#.

let gaussNewton (r: Vector<float> -> Vector<float>) (J: Vector<float> -> Matrix<float>) (beta0: Vector<float>) (maxIter: int) (tol: float) =    
    let rec iterate beta iter =
        if iter > maxIter then beta 
        else
            let Jbeta = J beta
            let rBeta = r beta
            let delta = -(Jbeta.Transpose() * Jbeta).Inverse() * (Jbeta.Transpose() * rBeta)
            let newBeta = beta + delta
            if delta.L2Norm() < tol then newBeta 
            else iterate newBeta (iter + 1) 
    iterate beta0 0

The Levenberge-Marquardt Algorithm.

If you take a closer look at the Levenberge-Marquardt method, it combines the advantages of gradient decent (which is stability) and gauss-Newton methods(which is speed). The Levenberge-Marquardt instead of using line search like the Gauss-Newton methods it instead uses a trust region strategy.

Without going further into the details using trust region avoids one of the weaknesses of Gauss–Newton, which is its behavior when the Jacobian $J(x)$ is rank-deficient.How the Levenberg-Marquardt algorithm avoids this is in how the hessian matrix $J^TJ$ is approximated,it introduces a different approximation unlike the gauss newton method. $\begin{align*} H &\approx J^TJ + \mu I \\ \text{Where}: \\ \mu &\text{ is positive, combination coefficient} \\ I &\text{ is the Identity Matrix} \end{align*}$

this makes the update rule for the Levenberg-Marquardt to be defined as :

\[w_{k+1} = w_k - (J^T_kJ_k + \mu I)^{-1}J_ke_k\]

Pseudo Code

Initialize weights $w_0$, damping parameter $\mu > 0$, and tolerance $\epsilon$.
Repeat until convergence:
- Step 1: Compute the Jacobian matrix $J_k$ and the error vector $e_k$ at $w_k$.
- Step 2: Compute the approximate Hessian matrix:
  $H = J_k^T J_k + \mu I$
- Step 3: Compute the update direction:
  $\Delta w_k = -H^{-1} J_k^T e_k$
- Step 4: Update the weights:
  $w_{k+1} = w_k + \Delta w_k$
- Step 5: Evaluate the error function at the new weights $w_{k+1}$:
  - If the error decreases:
    - Accept the new weights $w_{k+1}$.
    - Decrease $\mu$ (e.g., $\mu = \mu / \beta ), where ( \beta > 1$.
  - Else:
    - Reject $w_{k+1}$ and increase $\mu$ (e.g.,$\mu = \mu \cdot \beta$.
Check for convergence:
- Stop if $| \Delta w_k | < \epsilon$ or the error does not improve significantly.

Implementation in F#

We can proceed to do a basic implementation with LU decomposition.For the purpose of understanding.I’d recommend using the Fsharp.stats library for non linear fitting given its robustness.

let levenbergMarquardt jacobian  func (xData: float[][]) (yData: float[]) (initialParameters: float[]) (maxIter: int) (lambda: float) =
    let mutable parameters = initialParameters
    let mutable lambda = lambda
    let nParams = initialParameters.Length

    for iter in 1 .. maxIter do
        let J = Array.init xData.Length (fun i -> jacobian xData.[i] parameters)
        let r = Array.init xData.Length (fun i -> yData.[i] - func xData.[i] parameters)

        let JT = Array.transpose J
        let JTJ = Array.init nParams (fun i -> 
            Array.init nParams (fun j -> Array.sumBy (fun k -> JT.[i].[k] * J.[k].[j]) [| 0 .. J.Length - 1 |]))
        let JTr = Array.init nParams (fun i -> Array.sumBy (fun k -> JT.[i].[k] * r.[k]) [| 0 .. J.Length - 1 |])

        //we Modify the hessian with levenberg regularization
        let H = Array.init nParams (fun i -> 
            Array.init nParams (fun j -> if i = j then JTJ.[i].[j] + lambda else JTJ.[i].[j]))

        //solve ln eq with LU decomposition
        let deltaW = solveWithLU H JTr

        parameters <- Array.map2 (+) parameters deltaW

        //lambda tuning
        let newError = Array.sumBy (fun i -> (yData.[i] - func xData.[i] parameters) ** 2.0) [| 0 .. xData.Length - 1 |]
        let oldError = Array.sumBy (fun i -> (yData.[i] - func xData.[i] (Array.map2 (-) parameters deltaW)) ** 2.0) [| 0 .. xData.Length - 1 |]

        if newError < oldError then
            lambda <- lambda / 10.0
        else
            lambda <- lambda * 10.0

    parameters

To verify that our implementation works, we fit a quadratic function to sample data points.

let modelObjfunc (x: float[]) (parameters: float[]) : float =   
    parameters.[0] * x.[0] ** 2.0 + parameters.[1] * x.[0] + parameters.[2]

let jacobian (x: float[]) (parameters: float[]) =
    [| x.[0] ** 2.0; x.[0]; 1.0 |]  

let xData = [| [| 1.0 |]; [| 2.0 |]; [| 3.0 |]; [| 4.0 |] |]
let yData = [| 2.2; 2.8; 3.6; 4.5 |]
let initialParameters = [| 0.5; 0.5; 0.5 |]
let maxIterations = 500
let lambda = 0.01

let optimizedParams = levenbergMarquardt jacobian  modelObjfunc xData yData initialParameters maxIterations lambda

The results from the above are: ` [|0.075; 0.395; 1.725|]` After running the optimization, we can plot the fitted curve using Plotly.NET

Conclusions.

The Levenberg-Marquardt algorithm is best applied in curve fitting. It blends the best of gradient descent (which is great for slow, steady learning) and Gauss-Newton (which is faster but less stable). By adapting its approach based on how close it is to the best solution, it improves accuracy and speed. However, like many optimization techniques, it usually finds the closest good solution (local minimum) rather than the absolute best one (global minimum).Another challenge is when the residuals are large in converges slower or fails to converge at all.