Understanding BFGS Optimization with F# Implementation
Application of the BFGS method to the Nelson Sigel Svensson Model for Kenyan Yield Curve Fitting
1.) Introduction.
Optimization methods has always been one of my favorite topics given their central role in solving complex problems involving probability. Many problems, especially those involving parameter estimation, rely heavily on optimization techniques. These methods help us estimate parameters that are essential for developing models, and in turn, interpreting real-world data. For example, in interpolation models such as the Nelson-Siegel-Svensson (NSS) framework, optimization algorithms are used to estimate the parameters that define the curve. Once these parameters are determined, they can be used to generate smooth yield curves for bond markets.
So, what is an optimization algorithm, and why is the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm a particularly powerful method? In this article, we’ll explore the BFGS algorithm, its relevance in solving nonlinear optimization problems, and how it can be implemented using F#.
In general, an optimization algorithm iteratively refines the solution to a problem by improving the design specifications a function that needs to be minimized or maximized—until a stopping criterion is met. This criterion could be either reaching the best possible solution or exhausting the budgeted time or computational resources.
To better illustrate, let’s define a simple optimization problem:
\[\frac{minimize}{x} \ f(x) \quad \text{subject to} \quad x \in X\]Where $X$ is the design point, this can be a vector of values for different variable,elements in this vector can be adjusted to minimize the objective function and $f(x)$ is the objective function. The goal is to find the value of $x$ that minimizes $f(x)$, i.e., the optimal solution that best fits the given constraints.
2.) Broyden-Fletcher-Goldfarb-Shanno (BFGS) Method.
The BFGS method is a quasi newton method for solving unconstrained nonlinear optimization (Dai, 2002).However much it is generally effective for convex functions, Dai (2002) demonstrated that BFGS with Wolfe line searches may not converge for nonconvex functions.
The Wolfe Conditions Line search
Line search determines the appropriate step length that is used to obtain the next design point in the search direction. While selecting a step size is crucial, spending excessive time on this process can negatively impact the convergence rate and overall efficiency of the optimization algorithm. This underscores the importance of an effective line search strategy. Ideally, one would aim for a global minimizer; however, identifying this value is generally too costly in terms of computation.
The Wolfe conditions play a crucial role in optimization algorithms, ensuring convergence to stationary points.They maintain the positive definiteness of the reduced Hessian approximations (Gilbert, 1997). These conditions ensure sufficient decrease and curvature in the objective function $f$.
Mathematically the two conditions (sufficient decrease and curvature) are represented respectively as :
\[f(x_k + \alpha_kp_k) \leq f(x_k) + c_1\alpha_k \nabla f_k^Tp_k\] \[\nabla f(x_k + \alpha_kp_k)^T p_k \geq c_2\nabla f_k^T p_k\]With $0 < c_1 < c_2 < 1$ The step length which is denoted by $\alpha_k$ should satisfy these conditions in order for it to be accepted and the iteration to move to the next.
There is more to the wolfe conditions than what is captured here a text to look at is by wright called numerical optimization which gets into the details on this quite extensively.
Consider an optimization problem involving a line search for the function $f(x_1, x_2, x_3) = \sin(x_1 x_2) + \exp(x_2 + x_3) - x_3 )$,with the initial point $x=[1, 2, 3]$ and search direction $d = [0, -1, -1]$.
The objective is to find the optimal step size $\alpha$ that minimizes the function along this search direction, subject to the Wolfe conditions.
The line search optimization problem can be expressed as:
\[\min_{\alpha} \left[ \sin((1 + 0\alpha)(2 - \alpha)) + \exp((2 - \alpha) + (3 - \alpha)) - (3 - \alpha) \right]\]Simplifying the expression:
\[\sin(2 - \alpha) + \exp(5 - 2\alpha) + \alpha - 3\]Solving this, we find that the optimal step size $\alpha$ is approximately $3.127$, which updates the new point to:
\[x_{\text{new}} = [1, -1.126, -0.126]\]you can verify that the $\alpha$ we get here meets the wolf conditions.
Its important to recall the Sufficient Decrease Condition condition as we will work on verifying that condition below.
We begin by calculating the Gradient\Partial Derivatives:
\[\frac{\partial f}{\partial x_1} = x_2 \cos(x_1 x_2)\] \[\frac{\partial f}{\partial x_2} = x_1 \cos(x_1 x_2) + \exp(x_2 + x_3)\] \[\frac{\partial f}{\partial x_3} = \exp(x_2 + x_3) - 1\]At the initial point $x = [1, 2, 3]$, the gradient is:
\[\nabla f(x) = \left[ 2 \cos(2), \cos(2) + \exp(5), \exp(5) - 1 \right]\]We can simplify this to: \(\nabla f(x) \approx \left[ -0.832, 148.83, 147.41 \right]\)
Compute the Objective Function Values:
- $f(x) = \sin(1 \times 2) + \exp(2 + 3) - 3 = \sin(2) + \exp(5) - 3 \approx 0.909 + 148.41 - 3 \approx 146.32$
Now, for the new point after taking the step:
- $x_{\text{new}} = [1, -1.126, -0.126]$
We calculate the new function value:
- $f(x_{\text{new}}) = \sin(1 \times -1.126) + \exp(-1.126 - 0.126) - (-0.126)$
This simplifies to:
- $f(x_{\text{new}}) \approx \sin(-1.126) + \exp(-1.252) + 0.126 \approx -0.902 + 0.285 + 0.126 = -0.491$
Check the Sufficient Decrease Condition:
\(f(x_{\text{new}}) \leq f(x) + c_1 \alpha_k \nabla f(x)^T p_k\)
Assuming typical values for $c_1 = 0.1$ and using the search direction $p_k = [0, -1, -1]$, we calculate $\nabla f(x)^T p_k$:
\[\nabla f(x)^T p_k = -0.832 \times 0 + 148.83 \times (-1) + 147.41 \times (-1) = -148.83 - 147.41 = -296.24\]Now, checking the condition: \(-0.491 \leq 146.32 + 0.1 \times 3.127 \times (-296.24)\)
\[-0.491 \leq 146.32 - 92.68 = 53.64\]The sufficient decrease condition is satisfied since $-0.491 \leq 53.64$.
The sufficient decrease Wolfe condition is satisfied for $\alpha = 3.127$, making this a valid step length.
BFGS Algorithm
Given starting point
choose an initial guess $x_0$, convergence tolerance $\epsilon > 0$ Approximate Hessian $H_0$
$k \leftarrow 0$;
\[\text{While} || \nabla{f_k}|| > \epsilon\]$\quad\quad\quad \text{Compute search direction}$
$\quad\quad\quad\quad \quad\quad\quad p_k = -H_k\nabla f_k$
$\quad\quad\quad \text{Set} x_{k+1} = x_k + a_kp_k \text{Where}\quad a_k\quad \text{is computed using a line search to satisfy the wold conditions}$
$\quad\quad\quad \text{Define}\quad S_k = x_{k+1} - x_k\quad \text{and} \quad y_k = \nabla f_{k+1} - \nabla f_k$
$\quad\quad\quad \text{Compute}\quad H_{k+1} \quad\text{by means of}$
$\quad\quad\quad k\leftarrow k + 1$
${end(While)}$
Hessian Matrix Update Rule
A Hessian matrix is used in quasi-newton optimization to give the curvature of the function, meaning how fast the gradient varies.It basically has all the second derivatives,this components ensures the approximation has the same curvature as the original function.
In any optimization problem the first thing we need is the derivative or partial derivative as this dictates the direction the optimization is going to take.
Example:
Assuming we have a 2 dimension function $f : \mathbb{R}^2\rightarrow \mathbb{R}$,the gradient is as large as the input of the dimension function.
\[\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \end{bmatrix}\]but for the Hessian Matrix it has a bit more values for a two dimension it’ll have 4 values.
\[H(f) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} \end{bmatrix}\]so there are always more values to compute in the hessian than the gradient hence it’ll have a higher computational cost.There enters the challenge with Newton method the exact Hessian is hard to come by when n is large.Storing the H matrix ($n^2$ numbers) might be impractical.
The BFGS method is based on the basic idea of replacing the full Hessian matrix $H$ by an approximate matrix $B$ in terms of an iterative updating formula with rank-one matrices as its increment.
In fact,since what is needed for the Newton step is $(H_k)^{−1}$, usually one stores a low rank approximate inverse Hessian. To obtain this, we want to iteratively construct our approximate $H_{k+1}$ or $(H_{k+1})^{−1})$ given only the gradient (first derivative) of $f$.
One important thing is to have the Hessian matrix to be a positive definite.This is mostly achieved by using a weighting matrix this guarantees the positive definiteness.The BFGS Hessian update achieves this by comparing the inverse Hessian Approximation of the current and the previous iteration and then using an update rule shown below to update the Approximate Hessian.
The update is given by:
\[\frac{min}{H\in \mathbb{R}^{n \times n}} ||H^{-1} - (H^k)^{-1}||_w\]this minimization has a closed form solution to it i won’t go into the derivation as this articles is now becoming lengthy.
\[H_{k+1} = (I − ρ_k s_k y_k^T )H_k(I − ρ_k y_k s_k^T) + ρ_k s_k s_k^T\]F# Implementation.
We will do the implementation in a similar approach starting with the Line search:
1
2
3
4
5
6
7
8
9
10
// Wolfe condition line search
let rec wolfeLineSearch (fg: FuncWithGradient) (x: float[]) (p: float[]) (g: float[]) (alpha: float) (c1: float) (c2: float) =
let rec helper alpha =
let newX = Array.map2 (+) x (Array.map ((*) alpha) p)
let newGrad = fg.grad newX
if fg.f newX > fg.f x + c1 * alpha * dot g p || dot newGrad p < c2 * dot g p then
helper (alpha / 2.0)
else
alpha
helper alpha
We halve the $\alpha$ until the Wolfe conditions are met as this is widely used approach to ensure convergence along the search direction.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
let bfgs (fg: FuncWithGradient) (x0: float[]) (tol: float) (maxIter: int) =
let n = x0.Length
let mutable iterationData = []
let rec loop x g H iter =
let gradNorm = Array.sumBy (fun gi -> gi ** 2.0) g |> sqrt
//if Array.sumBy (fun gi -> gi ** 2.0) g |> sqrt <= tol || iter >= maxIter then x
// convergence check
if gradNorm <= tol || iter >= maxIter then
let finalData = {
Iteration = iter
GradientNorm = gradNorm
Tolerance = tol
Alpha = 0.0
StepSize = Array.zeroCreate n
Position = x
FunctionValue = fg.f x
}
// Append the last convergence iteration info.
iterationData <- finalData :: iterationData
(x, List.rev iterationData)
else
let p = Array.map (fun gi -> -gi) (matVecMult H g) // This should be our search direction.
let alpha = wolfeLineSearch fg x p g 1.0 1e-4 0.9
let s = Array.map ((*) alpha) p
let xNew = Array.map2 (+) x s
let gNew = fg.grad xNew
// Collect nth iteration info.
let iterInfo = {
Iteration = iter
GradientNorm = gradNorm
Tolerance = tol
Alpha = alpha
StepSize = s
Position = xNew
FunctionValue = fg.f xNew
}
iterationData <- iterInfo :: iterationData
let y = Array.map2 (-) gNew g
let rho = 1.0 / dot y s
let I = identityMatrix n
let sOuterY = outerProduct s y
let yOuterS = outerProduct y s
let sOuterS = outerProduct s s
let term1 = Array.init n (fun i -> Array.init n (fun j -> I.[i].[j] - rho * sOuterY.[i].[j]))
let term2 = Array.init n (fun i -> Array.init n (fun j -> I.[i].[j] - rho * yOuterS.[i].[j]))
let HNew = matMatMult (matMatMult term1 H) term2 |> Array.mapi (fun i row -> Array.mapi (fun j v -> v + rho * sOuterS.[i].[j]) row)
loop xNew gNew HNew (iter + 1)
let H0 = identityMatrix n
loop x0 (fg.grad x0) H0 0
As we can see the chunk of the work is on the updating the New Hessian Matrix.The beauty of F# at play recursion makes the code cleaner than for loops doesn’t it!
Let’s test this and see how accurate it is.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
let rosenbrock (x: float[]) =
let a = 1.0
let b = 100.0
let x1, x2 = x.[0], x.[1]
pown (a - x1) 2 + b * pown (x2 - pown x1 2) 2
let rosenbrockGradient (x: float[]) =
let a = 1.0
let b = 100.0
let x1, x2 = x.[0], x.[1]
let dx1 = -2.0 * (a - x1) - 4.0 * b * x1 * (x2 - pown x1 2)
let dx2 = 2.0 * b * (x2 - pown x1 2)
[| dx1; dx2 |]
let fg: BFGSMethod.FuncWithGradient = { f = rosenbrock; grad = rosenbrockGradient }
// Our initial guess.
let x0 = [| -1.2; 1.0 |]
let tol = 1e-6
let maxIter = 1000
let result = BFGSMethod.bfgs fg x0 tol maxIter
let (optimizedPosition, iterationHistory) = result
This should output:
Optimization result: [|0.9999999952; 0.9999999902|]
//The convergence details.
Convergence Details: { Iteration = 34
GradientNorm = 8.83462856e-08
Tolerance = 1e-06
Alpha = 0.0
StepSize = [|0.0; 0.0|]
Position = [|0.9999999952; 0.9999999902|]
FunctionValue = 2.745635491e-17 }
I’ll proceed to compare this with the output of the MathNET.
NOTE: I am only posting key code snippets here so as not to fill the write up with a lot of code but checkout my github the full code should be there it’ll be linked at the end of the write up.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
let BFGS f grad initialGuess =
let objectiveFunction =
new System.Func<Vector<float>,float>(f)
let gradientFunction =
new System.Func<Vector<float>,Vector<float>>(grad)
let obj = ObjectiveFunction.Gradient(objectiveFunction,gradientFunction)
let solver = BfgsMinimizer(1e-5, 1e-5, 1e-5,1000)
let result = solver.FindMinimum(obj,initialGuess)
{
MinimizingPoint = result.MinimizingPoint |> Vector.toSeq
Iterations = result.Iterations
FunctionValue = result.FunctionInfoAtMinimum.Value
ReasonForExit = result.ReasonForExit
}
let x0 = [| -1.2; 1.0 |] |> vector
let result = BFGS rosenbrockFunction gradient x0
Convergence Details:
{ MinimizingPoint = [|0.9999999999; 0.9999999997|]
Iterations = 37
FunctionValue = 6.853123565e-19
ReasonForExit = AbsoluteGradient }
From the results we can see that we are on the right direction.
Application to the Kenyan Yield Curve Interpolation Using the Nelson-Siegel-Svensson Model.
We started by looking at the BFGS optimization method by a general introduction through to the specific components that build up the algorithm. We did an implementation in F# with and ran tests on its performance,comparing our results against those from MathNet to validate the implementation.The outcomes of our results were close.
Given all this Mathematical concepts no matter how elegant,are most powerful when applied to solve actual problems and provide actual solutions.In this next section we will use the BFGS algorithm to interpolate bond yields, allowing us to assess the Kenyan bond market and anticipate potential trends.This now becomes useful information to fund mangers as this can inform the investment direction.
We will fetch the yield curve data from world government bonds site.For this write up we won’t dwell into the Svensson Model maybe at some later article.
To implement our interpolation, we will use the BFGS module built above within the fitNSSModel
function. The objective function here minimizes the sum of squared errors. To further test our model’s performance, we’ll compare results using MathNet and experiment with Automatic Differentiation using the DiffSharp to determine if it improves accuracy.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// Curve fitting using the BFGS module
let fitNSSModel maturities yields maxIter =
let initialGuess = [|0.01;0.01;0.01;0.01;1.0;1.0|]
let tol = 1e-6
let maxIter = maxIter
let fg = {
f = (fun beta -> fst (nssObjectiveAndGradient beta maturities yields))
grad = (fun beta -> snd (nssObjectiveAndGradient beta maturities yields))
}
// Run BFGS optimization
let result, iterationData = bfgs fg initialGuess tol maxIter
result, iterationData
let time = [|0.25;0.5;1.0;2.0;3.0;4.0;5.0;6.0;7.0;8.0;9.0;10.0;15.0;20.0;25.0|]
let ytms = Array.map(fun x -> x/100.0)[| 15.988; 16.850; 16.921; 16.684; 15.817; 16.778; 16.976; 17.035; 17.848; 17.616; 17.532; 17.476; 17.400; 17.260; 17.120 |]
// Fit NSS model to the data
let result,iterationData = fitNSSModel time ytms 500
//Visualization using Ploty.Net
let fittingFunctionNSS =
svenssonmodel result
let NSSChart =
let rawChartNSS =
Chart.Point(time,ytms)
|> Chart.withTraceInfo "Actual data"
let fittingNSS =
let fit =
[|0.25 .. 0.5 .. 25|]
|> Array.map (fun x -> x,fittingFunctionNSS result x)
Chart.Line(fit)
|> Chart.withTraceInfo "NSS model-Custom BFGS"
let fittingNSSMathnet =
let fit =
[|0.25 .. 0.5 .. 25|]
|> Array.map (fun x -> x,fittingFunctionNSS (mathNetResult.MinimizingPoint |> Seq.toArray) x)
Chart.Line(fit)
|> Chart.withTraceInfo "NSS model-MathNet BFGS"
let fittingNSSAutoDiff =
let fit =
[|0.25 .. 0.5 .. 25|]
|> Array.map (fun x -> x,fittingFunctionNSS AutoDiffRes x)
Chart.Line(fit)
|> Chart.withTraceInfo "NSS model-AutoDiff BFGS"
[rawChartNSS;fittingNSS;fittingNSSMathnet;fittingNSSAutoDiff]
|> Chart.combine
|> Chart.withTitle "Nelson–Siegel–Svensson model"
|> Chart.withXAxisStyle (TitleText = "Years To Maturity")
|> Chart.withYAxisStyle (TitleText = "Yield To Maturities", MinMax = (0.15, 0.20))
NSSChart |> Chart.show
From our results, we observe that the fitted curve approximates the yield curve reasonably well.It can’t go without mentioning that Automatic diff hasn’t improved performance as well.
This method has limitations. For example, the BFGS method doesn’t use constraints on parameter boundaries, which can be critical. As such, while BFGS provides a near-optimal fit, it may not be ideal for this non-linear problem.
Conclusion
In this write up, we achieved our objective of understanding and implementing the BFGS optimization method using F#, applying it to a real-world curve-fitting problem. However, it’s worth noting that the NSS model is a non-linear least-squares problem and may be better fitted using the Levenberg–Marquardt algorithm.
There are further improvements that need to be done to the BFGS algorithm we built in some instances the wolf conditions might not be meet depending on how smooth the function is. Also matrices are not first class citizen in F# but using Linear algebra which entails the Matrix multiplications and cholesky decomposition would go a long way in improving the performance.
References
Dai, Y. H. (2002). Convergence properties of the BFGS algorithm. SIAM Journal on Optimization, 13(3), 693-701.
Wright, S. J. (2006). Numerical optimization.