Understanding BFGS Optimization with F# Implementation

Application of the BFGS method to the Nelson Sigel Svensson Model for Kenyan Yield Curve Fitting
1.) Introduction.
Optimization methods has always been one of my favorite topics given their central role in solving complex problems involving probability. Many problems, especially those involving parameter estimation, rely heavily on optimization techniques. These methods help us estimate parameters that are essential for developing models, and in turn, interpreting real-world data. For example, in interpolation models such as the Nelson-Siegel-Svensson (NSS) framework, optimization algorithms are used to estimate the parameters that define the curve. Once these parameters are determined, they can be used to generate smooth yield curves for bond markets.
So, what is an optimization algorithm, and why is the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm a particularly powerful method? In this article, we’ll explore the BFGS algorithm, its relevance in solving nonlinear optimization problems, and how it can be implemented using F#.
In general, an optimization algorithm iteratively refines the solution to a problem by improving the design specifications a function that needs to be minimized or maximized—until a stopping criterion is met. This criterion could be either reaching the best possible solution or exhausting the budgeted time or computational resources.
To better illustrate, let’s define a simple optimization problem:
Where
2.) Broyden-Fletcher-Goldfarb-Shanno (BFGS) Method.
The BFGS method is a quasi newton method for solving unconstrained nonlinear optimization (Dai, 2002).However much it is generally effective for convex functions, Dai (2002) demonstrated that BFGS with Wolfe line searches may not converge for nonconvex functions.
The Wolfe Conditions Line search
Line search determines the appropriate step length that is used to obtain the next design point in the search direction. While selecting a step size is crucial, spending excessive time on this process can negatively impact the convergence rate and overall efficiency of the optimization algorithm. This underscores the importance of an effective line search strategy. Ideally, one would aim for a global minimizer; however, identifying this value is generally too costly in terms of computation.
The Wolfe conditions play a crucial role in optimization algorithms, ensuring convergence to stationary points.They maintain the positive definiteness of the reduced Hessian approximations (Gilbert, 1997). These conditions ensure sufficient decrease and curvature in the objective function
Mathematically the two conditions (sufficient decrease and curvature) are represented respectively as :
With
There is more to the wolfe conditions than what is captured here a text to look at is by wright called numerical optimization which gets into the details on this quite extensively.
Consider an optimization problem involving a line search for the function
,with the initial point and search direction . The objective is to find the optimal step size
that minimizes the function along this search direction, subject to the Wolfe conditions. The line search optimization problem can be expressed as:
Simplifying the expression:
Solving this, we find that the optimal step size
is approximately , which updates the new point to: you can verify that the
we get here meets the wolf conditions.
Its important to recall the Sufficient Decrease Condition condition as we will work on verifying that condition below.
We begin by calculating the Gradient\Partial Derivatives:
At the initial point
We can simplify this to:
Compute the Objective Function Values:
Now, for the new point after taking the step:
We calculate the new function value:
This simplifies to:
Check the Sufficient Decrease Condition:
Assuming typical values for
Now, checking the condition:
The sufficient decrease condition is satisfied since
The sufficient decrease Wolfe condition is satisfied for
BFGS Algorithm
Given starting point
choose an initial guess
Hessian Matrix Update Rule
A Hessian matrix is used in quasi-newton optimization to give the curvature of the function, meaning how fast the gradient varies.It basically has all the second derivatives,this components ensures the approximation has the same curvature as the original function.
In any optimization problem the first thing we need is the derivative or partial derivative as this dictates the direction the optimization is going to take.
Example:
Assuming we have a 2 dimension function
but for the Hessian Matrix it has a bit more values for a two dimension it’ll have 4 values.
so there are always more values to compute in the hessian than the gradient hence it’ll have a higher computational cost.There enters the challenge with Newton method the exact Hessian is hard to come by when n is large.Storing the H matrix (
The BFGS method is based on the basic idea of replacing the full Hessian matrix
In fact,since what is needed for the Newton step is
One important thing is to have the Hessian matrix to be a positive definite.This is mostly achieved by using a weighting matrix this guarantees the positive definiteness.The BFGS Hessian update achieves this by comparing the inverse Hessian Approximation of the current and the previous iteration and then using an update rule shown below to update the Approximate Hessian.
The update is given by:
this minimization has a closed form solution to it i won’t go into the derivation as this articles is now becoming lengthy.
F# Implementation.
We will do the implementation in a similar approach starting with the Line search:
1
2
3
4
5
6
7
8
9
10
// Wolfe condition line search
let rec wolfeLineSearch (fg: FuncWithGradient) (x: float[]) (p: float[]) (g: float[]) (alpha: float) (c1: float) (c2: float) =
let rec helper alpha =
let newX = Array.map2 (+) x (Array.map ((*) alpha) p)
let newGrad = fg.grad newX
if fg.f newX > fg.f x + c1 * alpha * dot g p || dot newGrad p < c2 * dot g p then
helper (alpha / 2.0)
else
alpha
helper alpha
We halve the
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
let bfgs (fg: FuncWithGradient) (x0: float[]) (tol: float) (maxIter: int) =
let n = x0.Length
let mutable iterationData = []
let rec loop x g H iter =
let gradNorm = Array.sumBy (fun gi -> gi ** 2.0) g |> sqrt
//if Array.sumBy (fun gi -> gi ** 2.0) g |> sqrt <= tol || iter >= maxIter then x
// convergence check
if gradNorm <= tol || iter >= maxIter then
let finalData = {
Iteration = iter
GradientNorm = gradNorm
Tolerance = tol
Alpha = 0.0
StepSize = Array.zeroCreate n
Position = x
FunctionValue = fg.f x
}
// Append the last convergence iteration info.
iterationData <- finalData :: iterationData
(x, List.rev iterationData)
else
let p = Array.map (fun gi -> -gi) (matVecMult H g) // This should be our search direction.
let alpha = wolfeLineSearch fg x p g 1.0 1e-4 0.9
let s = Array.map ((*) alpha) p
let xNew = Array.map2 (+) x s
let gNew = fg.grad xNew
// Collect nth iteration info.
let iterInfo = {
Iteration = iter
GradientNorm = gradNorm
Tolerance = tol
Alpha = alpha
StepSize = s
Position = xNew
FunctionValue = fg.f xNew
}
iterationData <- iterInfo :: iterationData
let y = Array.map2 (-) gNew g
let rho = 1.0 / dot y s
let I = identityMatrix n
let sOuterY = outerProduct s y
let yOuterS = outerProduct y s
let sOuterS = outerProduct s s
let term1 = Array.init n (fun i -> Array.init n (fun j -> I.[i].[j] - rho * sOuterY.[i].[j]))
let term2 = Array.init n (fun i -> Array.init n (fun j -> I.[i].[j] - rho * yOuterS.[i].[j]))
let HNew = matMatMult (matMatMult term1 H) term2 |> Array.mapi (fun i row -> Array.mapi (fun j v -> v + rho * sOuterS.[i].[j]) row)
loop xNew gNew HNew (iter + 1)
let H0 = identityMatrix n
loop x0 (fg.grad x0) H0 0
As we can see the chunk of the work is on the updating the New Hessian Matrix.The beauty of F# at play recursion makes the code cleaner than for loops doesn’t it!
Let’s test this and see how accurate it is.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
let rosenbrock (x: float[]) =
let a = 1.0
let b = 100.0
let x1, x2 = x.[0], x.[1]
pown (a - x1) 2 + b * pown (x2 - pown x1 2) 2
let rosenbrockGradient (x: float[]) =
let a = 1.0
let b = 100.0
let x1, x2 = x.[0], x.[1]
let dx1 = -2.0 * (a - x1) - 4.0 * b * x1 * (x2 - pown x1 2)
let dx2 = 2.0 * b * (x2 - pown x1 2)
[| dx1; dx2 |]
let fg: BFGSMethod.FuncWithGradient = { f = rosenbrock; grad = rosenbrockGradient }
// Our initial guess.
let x0 = [| -1.2; 1.0 |]
let tol = 1e-6
let maxIter = 1000
let result = BFGSMethod.bfgs fg x0 tol maxIter
let (optimizedPosition, iterationHistory) = result
This should output:
Optimization result: [|0.9999999952; 0.9999999902|]
//The convergence details.
Convergence Details: { Iteration = 34
GradientNorm = 8.83462856e-08
Tolerance = 1e-06
Alpha = 0.0
StepSize = [|0.0; 0.0|]
Position = [|0.9999999952; 0.9999999902|]
FunctionValue = 2.745635491e-17 }
I’ll proceed to compare this with the output of the MathNET.
NOTE: I am only posting key code snippets here so as not to fill the write up with a lot of code but checkout my github the full code should be there it’ll be linked at the end of the write up.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
let BFGS f grad initialGuess =
let objectiveFunction =
new System.Func<Vector<float>,float>(f)
let gradientFunction =
new System.Func<Vector<float>,Vector<float>>(grad)
let obj = ObjectiveFunction.Gradient(objectiveFunction,gradientFunction)
let solver = BfgsMinimizer(1e-5, 1e-5, 1e-5,1000)
let result = solver.FindMinimum(obj,initialGuess)
{
MinimizingPoint = result.MinimizingPoint |> Vector.toSeq
Iterations = result.Iterations
FunctionValue = result.FunctionInfoAtMinimum.Value
ReasonForExit = result.ReasonForExit
}
let x0 = [| -1.2; 1.0 |] |> vector
let result = BFGS rosenbrockFunction gradient x0
Convergence Details:
{ MinimizingPoint = [|0.9999999999; 0.9999999997|]
Iterations = 37
FunctionValue = 6.853123565e-19
ReasonForExit = AbsoluteGradient }
From the results we can see that we are on the right direction.
Application to the Kenyan Yield Curve Interpolation Using the Nelson-Siegel-Svensson Model.
We started by looking at the BFGS optimization method by a general introduction through to the specific components that build up the algorithm. We did an implementation in F# with and ran tests on its performance,comparing our results against those from MathNet to validate the implementation.The outcomes of our results were close.
Given all this Mathematical concepts no matter how elegant,are most powerful when applied to solve actual problems and provide actual solutions.In this next section we will use the BFGS algorithm to interpolate bond yields, allowing us to assess the Kenyan bond market and anticipate potential trends.This now becomes useful information to fund mangers as this can inform the investment direction.
We will fetch the yield curve data from world government bonds site.For this write up we won’t dwell into the Svensson Model maybe at some later article.
To implement our interpolation, we will use the BFGS module built above within the fitNSSModel
function. The objective function here minimizes the sum of squared errors. To further test our model’s performance, we’ll compare results using MathNet and experiment with Automatic Differentiation using the DiffSharp to determine if it improves accuracy.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// Curve fitting using the BFGS module
let fitNSSModel maturities yields maxIter =
let initialGuess = [|0.01;0.01;0.01;0.01;1.0;1.0|]
let tol = 1e-6
let maxIter = maxIter
let fg = {
f = (fun beta -> fst (nssObjectiveAndGradient beta maturities yields))
grad = (fun beta -> snd (nssObjectiveAndGradient beta maturities yields))
}
// Run BFGS optimization
let result, iterationData = bfgs fg initialGuess tol maxIter
result, iterationData
let time = [|0.25;0.5;1.0;2.0;3.0;4.0;5.0;6.0;7.0;8.0;9.0;10.0;15.0;20.0;25.0|]
let ytms = Array.map(fun x -> x/100.0)[| 15.988; 16.850; 16.921; 16.684; 15.817; 16.778; 16.976; 17.035; 17.848; 17.616; 17.532; 17.476; 17.400; 17.260; 17.120 |]
// Fit NSS model to the data
let result,iterationData = fitNSSModel time ytms 500
//Visualization using Ploty.Net
let fittingFunctionNSS =
svenssonmodel result
let NSSChart =
let rawChartNSS =
Chart.Point(time,ytms)
|> Chart.withTraceInfo "Actual data"
let fittingNSS =
let fit =
[|0.25 .. 0.5 .. 25|]
|> Array.map (fun x -> x,fittingFunctionNSS result x)
Chart.Line(fit)
|> Chart.withTraceInfo "NSS model-Custom BFGS"
let fittingNSSMathnet =
let fit =
[|0.25 .. 0.5 .. 25|]
|> Array.map (fun x -> x,fittingFunctionNSS (mathNetResult.MinimizingPoint |> Seq.toArray) x)
Chart.Line(fit)
|> Chart.withTraceInfo "NSS model-MathNet BFGS"
let fittingNSSAutoDiff =
let fit =
[|0.25 .. 0.5 .. 25|]
|> Array.map (fun x -> x,fittingFunctionNSS AutoDiffRes x)
Chart.Line(fit)
|> Chart.withTraceInfo "NSS model-AutoDiff BFGS"
[rawChartNSS;fittingNSS;fittingNSSMathnet;fittingNSSAutoDiff]
|> Chart.combine
|> Chart.withTitle "Nelson–Siegel–Svensson model"
|> Chart.withXAxisStyle (TitleText = "Years To Maturity")
|> Chart.withYAxisStyle (TitleText = "Yield To Maturities", MinMax = (0.15, 0.20))
NSSChart |> Chart.show
From our results, we observe that the fitted curve approximates the yield curve reasonably well.It can’t go without mentioning that Automatic diff hasn’t improved performance as well.
This method has limitations. For example, the BFGS method doesn’t use constraints on parameter boundaries, which can be critical. As such, while BFGS provides a near-optimal fit, it may not be ideal for this non-linear problem.
Conclusion
In this write up, we achieved our objective of understanding and implementing the BFGS optimization method using F#, applying it to a real-world curve-fitting problem. However, it’s worth noting that the NSS model is a non-linear least-squares problem and may be better fitted using the Levenberg–Marquardt algorithm.
There are further improvements that need to be done to the BFGS algorithm we built in some instances the wolf conditions might not be meet depending on how smooth the function is. Also matrices are not first class citizen in F# but using Linear algebra which entails the Matrix multiplications and cholesky decomposition would go a long way in improving the performance.
References
Dai, Y. H. (2002). Convergence properties of the BFGS algorithm. SIAM Journal on Optimization, 13(3), 693-701.
Wright, S. J. (2006). Numerical optimization.