Numerical optimization 
  Lecture 12
 
Numerical optimization - line search 
Today we will be discussing one particular approach for numerical optimization - line search. This is a family of algorithmic approaches that attempt to find (global or local) minima via iteration on an initial guess. Generally they are an attempt to solve,
\[
\underset{\alpha>0}{\text{min}} f(x_k + \alpha \, p_k)
\]  where \(f()\)  is the function we are attempting to minimize, \(x_k\)  is our current guess at iteration \(k\)  and \(\alpha\)  is the step length and \(p_k\)  is the direction of movement.
We will only be dipping our toes in the water of this area but the goal is to provide some context for some of the more common (and easier) use cases.
 
Naive Gradient Descent 
We will start with a naive approach to gradient descent where we choose a fixed step size and determine the direction based on the gradient of the function at each iteration.
def  grad_desc_1d(x0, f, grad, step, max_step= 100 , tol =  1e-6 ): 
   all_x_i =  [x0] 
   all_f_i =  [f(x0)] 
   x_i =  x0 
    
   try : 
     for  i in  range (max_step): 
       dx_i =  grad(x_i) 
       x_i =  x_i -  dx_i *  step 
       f_x_i =  f(x_i) 
       all_x_i.append(x_i) 
       all_f_i.append(f_x_i) 
        
       if  np.abs (dx_i) <  tol: break  
      
   except  OverflowError  as  err: 
     print (f" { type (err). __name__ } :  { err} " ) 
    
   if  len (all_x_i) ==  max_step+ 1 : 
     print ("Warning - Failed to converge!" ) 
    
   return  all_x_i, all_f_i 
 
 
 
A basic example 
\[
\begin{aligned}
f(x) &= x^2 \\
\nabla f(x) &= 2x
\end{aligned}
\] 
 
 f =  lambda  x: x** 2  
 grad =  lambda  x: 2 * x 
 
 
 
 opt =  grad_desc_1d(- 2. , f, grad, step= 0.25 ) 
 plot_1d_traj( (- 2 , 2 ), f, opt ) 
 
 
 opt =  grad_desc_1d(- 2. , f, grad, step= 0.5 ) 
 plot_1d_traj( (- 2 , 2 ), f, opt ) 
 
 
 
 
 
Where can it go wrong? 
If you pick a bad step size then bad things can happen,
 opt =  grad_desc_1d(- 2 , f, grad, step= 0.9 ) 
 plot_1d_traj( (- 2 ,2 ), f, opt ) 
 
 
 opt =  grad_desc_1d(- 2 , f, grad, step= 1 ) 
Warning - Failed to converge! 
 
 plot_1d_traj( (- 2 ,2 ), f, opt ) 
 
 
 
 
Local minima of a quartic 
Since the function is no longer convex - both starting point and step size matter.
\[
\begin{aligned}
f(x) &= x^4 + x^3 -x^2 - x \\
\nabla f(x) &= 4x^3 + 3x^2 - 2x - 1
\end{aligned}
\] 
 
 f =  lambda  x: x** 4  +  x** 3  -  x** 2  -  x  
 grad =  lambda  x: 4 * x** 3  +  3 * x** 2  -  2 * x -  1  
 
 
 
 opt =  grad_desc_1d(- 1.5 , f, grad, step= 0.2 ) 
 plot_1d_traj( (- 1.5 , 1.5 ), f, opt ) 
 
 
 opt =  grad_desc_1d(- 1.5 , f, grad, step= 0.25 ) 
 plot_1d_traj( (- 1.5 , 1.5 ), f, opt) 
 
 
 
 
 
Alternative starting points 
 opt =  grad_desc_1d(1.5 , f, grad, step= 0.2 ) 
 plot_1d_traj( (- 1.75 , 1.5 ), f, opt ) 
 
 
 opt =  grad_desc_1d(1.25 , f, grad, step= 0.2 ) 
 plot_1d_traj( (- 1.75 , 1.5 ), f, opt) 
 
 
 
 
Problematic step sizes 
If the step size is too large it is possible for the algorithm to
 opt =  grad_desc_1d(- 1.5 , f, grad, step= 0.75 ) 
OverflowError: (34, 'Result too large') 
 
 plot_1d_traj( (- 2 , 2 ), f, opt ) 
 
 
 opt =  grad_desc_1d(1.5 , f, grad, step= 0.25 ) 
OverflowError: (34, 'Result too large') 
 
 plot_1d_traj( (- 2 , 2 ), f, opt) 
 
 
 
 
Gradient Descent w/ backtracking 
As we have just seen having too large of a step can be problematic, one solution is to allow the step size to adapt.
Backtracking involves checking if the proposed move is advantageous (i.e. \(f(x_k+\alpha p_k) < f(x_k)\) ),
If it is advantageous then accept \(x_{k+1} = x_k+\alpha p_k\) .
 
If not, shrink \(\alpha\)  by a factor \(\tau\)  (e.g. 0.5) and check again.
 
 
Pick larger \(\alpha\)  to start as this will not fix inefficiency of small step size.
 
def  grad_desc_1d_bt( 
   x, f, grad, step, tau= 0.5 ,  
   max_step= 100 , max_back= 10 , tol =  1e-6  
 ): 
   all_x_i =  [x] 
   all_f_i =  [f(x)] 
    
   try : 
     for  i in  range (max_step): 
       dx =  grad(x) 
        
       for  j in  range (max_back): 
         new_x =  x +  step *  (- dx) 
         new_f_x =  f(new_x) 
         if  (new_f_x <  all_f_i[- 1 ]):  
           break  
         step =  step *  tau 
        
       x =  new_x 
       f_x =  new_f_x 
       all_x_i.append(x) 
       all_f_i.append(f_x) 
        
       if  np.abs (dx) <  tol: 
         break  
      
   except  OverflowError  as  err: 
     print (f" { type (err). __name__ } :  { err} " ) 
    
   if  len (all_x_i) ==  max_step+ 1 : 
     print ("Warning - Failed to converge!" ) 
    
   return  all_x_i, all_f_i 
 
 
 
 
 
 opt =  grad_desc_1d_bt(- 1.5 , f, grad,  
                       step= 0.75 , tau= 0.5 ) 
 plot_1d_traj( (- 1.5 , 1.5 ), f, opt ) 
 
 
 opt =  grad_desc_1d_bt(1.5 , f, grad,  
                       step= 0.25 , tau= 0.5 ) 
 plot_1d_traj( (- 1.5 , 1.5 ), f, opt) 
 
 
 
 
A 2d cost function 
We will be using mk_quad() to create quadratic functions with varying conditioning (as specified by the epsilon parameter).
\[
\begin{align}
f(x,y) &= 0.33(x^2 + \epsilon^2 y^2 ) \\
\nabla f(x,y) &= \left[ \begin{matrix}
0.66 \, x \\
0.66 \, \epsilon^2 \, y
\end{matrix} \right] \\
\nabla^2 f(x,y) &= \left[\begin{array}{cc}
0.66  & 0 \\
0     & 0.66 \, \epsilon^2
\end{array}\right]
\end{align}
\] 
 
Examples 
 f, grad, hess =  mk_quad(0.7 ) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f,  
              title= "$ \\ epsilon=0.7$" ) 
 
 
 f, grad, hess =  mk_quad(0.05 ) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f,  
              title= "$ \\ epsilon=0.05$" ) 
 
 
 
 
2d gradient descent w/ backtracking 
def  grad_desc_2d(x0, f, grad, step, tau= 0.5 , max_step= 100 , max_back= 10 , tol =  1e-6 ): 
   x_i =  x0 
   all_x_i =  [x_i[0 ]] 
   all_y_i =  [x_i[1 ]] 
   all_f_i =  [f(x_i)] 
    
   for  i in  range (max_step): 
     dx_i =  grad(x_i) 
      
     for  j in  range (max_back): 
       new_x_i =  x_i -  dx_i *  step 
       new_f_i =  f(new_x_i) 
       if  (new_f_i <  all_f_i[- 1 ]): break  
       step =  step *  tau 
        
     x_i, f_i =  new_x_i, new_f_i 
     all_x_i.append(x_i[0 ]) 
     all_y_i.append(x_i[1 ]) 
     all_f_i.append(f_i) 
        
     if  np.sqrt(np.sum (dx_i** 2 )) <  tol: 
       break  
    
   return  all_x_i, all_y_i, all_f_i 
 
 
 
Well conditioned cost function 
 f, grad, hess =  mk_quad(0.7 ) 
 opt =  grad_desc_2d((1.6 , 1.1 ), f, grad, step= 1 ) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f,  
              title= "$ \\ epsilon=0.7$" , traj= opt) 
 
 
 f, grad, hess =  mk_quad(0.7 ) 
 opt =  grad_desc_2d((1.6 , 1.1 ), f, grad, step= 2 ) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f,  
              title= "$ \\ epsilon=0.7$" , traj= opt) 
 
 
 
 
Ill-conditioned cost function 
 f, grad, hess =  mk_quad(0.05 ) 
 opt =  grad_desc_2d((1.6 , 1.1 ), f, grad, step= 1 ) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f,  
              title= "$ \\ epsilon=0.05$" , traj= opt) 
 
 
 f, grad, hess =  mk_quad(0.05 ) 
 opt =  grad_desc_2d((1.6 , 1.1 ), f, grad, step= 2 ) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f,  
              title= "$ \\ epsilon=0.05$" , traj= opt) 
 
 
 
 
Rosenbrock function (very ill conditioned) 
 f, grad, hess =  mk_rosenbrock() 
 opt =  grad_desc_2d((1.6 , 1.1 ), f, grad, step= 0.25 ) 
 plot_2d_traj((- 2 ,2 ), (- 2 ,2 ), f, traj= opt) 
 
 
 f, grad, hess =  mk_rosenbrock() 
 opt =  grad_desc_2d((- 0.5 , 0 ), f, grad, step= 0.25 ) 
 plot_2d_traj((- 2 ,2 ), (- 2 ,2 ), f, traj= opt) 
 
 
 
 
Taylor Expansion 
For any arbitrary smooth function, we can construct a 2nd order Taylor approximation as follows,
\[
\begin{align}
f(x_k + \alpha \, p_k)
&= f(x_k) + \alpha \, p_k^T \nabla f(x_k + \alpha \, p_k) \\
&= f(x_k) + \alpha \, p_k^T \nabla f(x_k) + \frac{1}{2} \alpha^2 p_k^T \, \nabla^2 f(x_k + \alpha \, p_k) \, p_k \\
&\approx f(x_k) + \alpha \, p_k^T \nabla f(x_k) + \frac{1}{2} \alpha^2 p_k^T \, \nabla^2 f(x_k) \, p_k
\end{align}
\] 
 
Newton’s Method in 1d 
Lets simplify things for now and consider just the 1d case and write \(\alpha\,p_k\)  as \(\Delta\) ,
\[
f(x_k + \Delta) \approx f(x_k) + \Delta f'(x_k) + \frac{1}{2} \Delta^2 f''(x_k)
\] 
to find the \(\Delta\)  that minimizes this function we can take a derivative with regard to \(\Delta\)  and set the equation equal to zero which gives,
\[
0 = f'(x_k) + \Delta f''(x_k) \;\; \Rightarrow \;\; \Delta = -\frac{f'(x_k)}{f''(x_k)}
\]  which then suggests an iterative update rule of
\[
x_{k+1} = x_{k} -\frac{f'(x_k)}{f''(x_k)}
\] 
 
Generalizing to \(n\) d 
Based on the same argument we can see the follow result for a function in \(\mathbb{R}^n\) ,
\[
f(x_k + \Delta) \approx f(x_k) + \Delta^T \nabla f(x_k) + \frac{1}{2} \Delta^T \, \nabla^2 f(x_k) \,\Delta
\] 
\[
0 = \nabla f(x_k) + \nabla^2 f(x_k) \, \Delta \;\; \Rightarrow \;\; \Delta = -\left(\nabla^2 f(x_k)\right)^{-1} \nabla f(x_k) f(x_k)
\]  which then suggests an iterative update rule of
\[
x_{k+1} = x_{k} - (\nabla^2 f(x_k))^{-1} \, \nabla f(x_k)
\] 
 
 
def  newtons_method(x0, f, grad, hess, max_iter= 100 , max_back= 10 , tol= 1e-8 ): 
     all_x_i =  [x0[0 ]] 
     all_y_i =  [x0[1 ]] 
     all_f_i =  [f(x0)] 
      
     x_i =  x0 
      
     for  i in  range (max_iter): 
       g_i =  grad(x_i) 
       step =  -  np.linalg.solve(hess(x_i), g_i) 
        
       for  j in  range (max_back): 
         new_x_i =  x_i +  step 
         new_f_i =  f(new_x_i) 
        
         if  (new_f_i <  all_f_i[- 1 ]): 
           break  
        
         step /=  2  
        
       x_i, f_i =  new_x_i, new_f_i 
        
       all_x_i.append(x_i[0 ]) 
       all_y_i.append(x_i[1 ]) 
       all_f_i.append(f_i) 
        
       if  np.sqrt(np.sum (g_i** 2 )) <  tol: 
         break  
      
     return  all_x_i, all_y_i, all_f_i 
 
 
 
Well conditioned quadratic cost function 
 f, grad, hess =  mk_quad(0.7 ) 
 opt =  newtons_method((1.6 , 1.1 ), f, grad, hess) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f, traj= opt) 
 
 
 f, grad, hess =  mk_quad(0.05 ) 
 opt =  newtons_method((1.6 , 1.1 ), f, grad, hess) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f, traj= opt) 
 
 
 
 
Rosenbrock function (very ill conditioned) 
 f, grad, hess =  mk_rosenbrock() 
 opt =  newtons_method((1.6 , 1.1 ), f, grad, hess) 
 plot_2d_traj((- 2 ,2 ), (- 2 ,2 ), f, traj= opt) 
 
 
 f, grad, hess =  mk_rosenbrock() 
 opt =  newtons_method((- 0.5 , 0 ), f, grad, hess) 
 plot_2d_traj((- 2 ,2 ), (- 2 ,2 ), f, traj= opt) 
 
 
 
 
Conjugate gradients 
This is a general approach for solving a system of linear equations with the form \(Ax=b\)  where \(A\)  is an \(n \times n\)  symmetric positive definite matrix and b is \(n \times 1\)  with \(x\)  unknown.
This type of problem can also be expressed as a quadratic minimization problems of the form,
\[
\underset{x}{\text{min}} \; f(x) = \frac{1}{2} x^T \, A \, x - b^T x + c
\] 
The goal is then to find \(n\)  conjugate vectors ( \(p^T_i \, A \, p_j = 0\)  for all \(i \neq j\) ) and their coefficients such that
\[ x_* = \sum_{i=1}^n \alpha_i \, p_i \] 
 
Conjugate gradient algorithm 
Given \(x_0\)  we set the following initial values,
\[\begin{align}
r_0 &= \nabla f(x_0) \\
p_0 &= -r_0 \\
k &= 0
\end{align}\] 
while \(\|r_k\|_2 > \text{tol}\) ,
\[
\begin{align}
\alpha_k &= \frac{r_k^T \, p_k}{p_k^T \, \nabla^2 f(x_k) \, p_k} \\
x_{k+1} &= x_k + \alpha_k \, p_k \\
r_{k+1} &= \nabla f(x_{k+1}) \\
\beta_{k} &= \frac{ r^T_{k+1} \, \nabla^2 f(x_k)  \, p_{k} }{p_k^T \, \nabla^2 f(x_k) \, p_k} \\
p_{k+1} &= -r_{k+1} + \beta_{k} \, p_k \\
k &= k+1
\end{align}
\] 
 
def  conjugate_gradient(x0, f, grad, hess,  
                        max_iter= 100 , tol= 1e-8 ): 
     all_x_i =  [x0[0 ]] 
     all_y_i =  [x0[1 ]] 
     all_f_i =  [f(x0)] 
      
     x_i =  x0 
     r_i =  grad(x0) 
     p_i =  - r_i 
      
     for  i in  range (max_iter): 
       a_i =  -  r_i.T @  p_i /  (p_i.T @  hess(x_i) @  p_i) 
       x_i_new =  x_i +  a_i *  p_i 
       r_i_new =  grad(x_i_new) 
       b_i =  (r_i_new.T @  hess(x_i) @  p_i) /  (p_i.T @  hess(x_i) @  p_i) 
       p_i_new =  - r_i_new +  b_i *  p_i 
        
       x_i, r_i, p_i =  x_i_new, r_i_new, p_i_new 
        
       all_x_i.append(x_i[0 ]) 
       all_y_i.append(x_i[1 ]) 
       all_f_i.append(f(x_i)) 
        
       if  np.sqrt(np.sum (r_i_new** 2 )) <  tol: 
         break  
      
     return  all_x_i, all_y_i, all_f_i 
 
 
 
 
Trajectory 
 f, grad, hess =  mk_quad(0.7 ) 
 opt =  conjugate_gradient((1.6 , 1.1 ), f, grad, hess) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f, title= "$ \\ epsilon=0.7$" , traj= opt) 
 
 
 f, grad, hess =  mk_quad(0.05 ) 
 opt =  conjugate_gradient((1.6 , 1.1 ), f, grad, hess) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f, title= "$ \\ epsilon=0.05$" , traj= opt) 
 
 
 
 
Rosenbrock’s function 
 f, grad, hess =  mk_rosenbrock() 
 opt =  conjugate_gradient((1.6 , 1.1 ), f, grad, hess) 
 plot_2d_traj((- 2 ,2 ), (- 2 ,2 ), f, traj= opt) 
 
 
 f, grad, hess =  mk_rosenbrock() 
 opt =  conjugate_gradient((- 0.5 , 0 ), f, grad, hess) 
 plot_2d_traj((- 2 ,2 ), (- 2 ,2 ), f, traj= opt) 
 
 
 
 
CG in scipy 
Scipy’s optimize module implements the conjugate gradient algorithm by Polak and Ribiere, a variant that does not require the Hessian,
Differences: 
\[
\beta_{k+1}^{PR} = \frac{\nabla f(x_{k+1}) \left(\nabla f(x_{k+1}) - \nabla f(x_{k})\right)}{\nabla f(x_k)^T \, \nabla f(x_k)}
\] 
 
def  conjugate_gradient_scipy(x0, f, grad, tol= 1e-8 ): 
     all_x_i =  [x0[0 ]] 
     all_y_i =  [x0[1 ]] 
     all_f_i =  [f(x0)] 
      
     def  store(X): 
         x, y =  X 
         all_x_i.append(x) 
         all_y_i.append(y) 
         all_f_i.append(f(X)) 
      
     optimize.minimize( 
       f, x0, jac= grad, method= "CG" ,  
       callback= store, tol= tol 
     ) 
      
     return  all_x_i, all_y_i, all_f_i 
 
 
 
 
Trajectory 
 f, grad, hess =  mk_quad(0.7 ) 
 opt =  conjugate_gradient_scipy((1.6 , 1.1 ), f, grad) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f, title= "$ \\ epsilon=0.7$" , traj= opt) 
 
 
 f, grad, hess =  mk_quad(0.05 ) 
 opt =  conjugate_gradient_scipy((1.6 , 1.1 ), f, grad) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f, title= "$ \\ epsilon=0.05$" , traj= opt) 
 
 
 
 
Rosenbrock’s function 
 f, grad, hess =  mk_rosenbrock() 
 opt =  conjugate_gradient_scipy((1.6 , 1.1 ), f, grad) 
 plot_2d_traj((- 2 ,2 ), (- 2 ,2 ), f, traj= opt) 
 
 
 f, grad, hess =  mk_rosenbrock() 
 opt =  conjugate_gradient_scipy((- 0.5 , 0 ), f, grad) 
 plot_2d_traj((- 2 ,2 ), (- 2 ,2 ), f, traj= opt) 
 
 
 
 
Method: Newton-CG 
Is a variant of Newtons method but does not require inverting the Hessian, or even a Hessian function (for the latter case it is estimated by finite differencing of the gradient)
def  newton_cg(x0, f, grad, hess= None , tol= 1e-8 ): 
     all_x_i =  [x0[0 ]] 
     all_y_i =  [x0[1 ]] 
     all_f_i =  [f(x0)] 
      
     def  store(X): 
         x, y =  X 
         all_x_i.append(x) 
         all_y_i.append(y) 
         all_f_i.append(f(X)) 
      
     optimize.minimize( 
       f, x0, jac= grad, hess= hess, tol= tol, 
       method= "Newton-CG" , callback= store  
     ) 
      
     return  all_x_i, all_y_i, all_f_i 
 
 
 
 
Trajectory - well conditioned 
 f, grad, hess =  mk_quad(0.7 ) 
 opt =  newton_cg((1.6 , 1.1 ), f, grad) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f, traj= opt, title= "w/o hessian" ) 
 
 
 f, grad, hess =  mk_quad(0.7 ) 
 opt =  newton_cg((1.6 , 1.1 ), f, grad, hess) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f, traj= opt, title= "w/ hessian" ) 
 
 
 
 
Trajectory - ill-conditioned 
 f, grad, hess =  mk_quad(0.05 ) 
 opt =  newton_cg((1.6 , 1.1 ), f, grad) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f, traj= opt, title= "w/o hessian" ) 
 
 
 f, grad, hess =  mk_quad(0.05 ) 
 opt =  newton_cg((1.6 , 1.1 ), f, grad, hess) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f, traj= opt, title= "w/ hessian" ) 
 
 
 
 
Rosenbrock’s function 
 f, grad, hess =  mk_rosenbrock() 
 opt =  newton_cg((1.6 , 1.1 ), f, grad) 
 plot_2d_traj((- 2 ,2 ), (- 2 ,2 ), f, traj= opt, title= "w/o hessian" ) 
 
 
 f, grad, hess =  mk_rosenbrock() 
 opt =  newton_cg((1.6 , 1.1 ), f, grad, hess) 
 plot_2d_traj((- 2 ,2 ), (- 2 ,2 ), f, traj= opt, title= "w/ hessian" ) 
 
 
 
 
Method: BFGS 
The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm is a quasi-newton which iterative improves its approximation of the Hessian,
def  bfgs(x0, f, grad, hess= None , tol= 1e-8 ): 
     all_x_i =  [x0[0 ]] 
     all_y_i =  [x0[1 ]] 
     all_f_i =  [f(x0)] 
      
     def  store(X): 
         x, y =  X 
         all_x_i.append(x) 
         all_y_i.append(y) 
         all_f_i.append(f(X)) 
      
     optimize.minimize( 
       f, x0, jac= grad, tol= tol, 
       method= "BFGS" , callback= store  
     ) 
      
     return  all_x_i, all_y_i, all_f_i 
 
 
 
 
Trajectory 
 f, grad, hess =  mk_quad(0.7 ) 
 opt =  bfgs((1.6 , 1.1 ), f, grad) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f, traj= opt) 
 
 
 f, grad, hess =  mk_quad(0.05 ) 
 opt =  bfgs((1.6 , 1.1 ), f, grad) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f, traj= opt) 
 
 
 
 
Rosenbrock’s function 
 f, grad, hess =  mk_rosenbrock() 
 opt =  bfgs((1.6 , 1.1 ), f, grad) 
 plot_2d_traj((- 2 ,2 ), (- 2 ,2 ), f, traj= opt) 
 
 
 f, grad, hess =  mk_rosenbrock() 
 opt =  bfgs((- 0.5 , 0 ), f, grad) 
 plot_2d_traj((- 2 ,2 ), (- 2 ,2 ), f, traj= opt) 
 
 
 
 
Method: Nelder-Mead 
This is a gradient free method that uses a series of simplexes which are used to iteratively bracket the minimum.
def  nelder_mead(x0, f, grad, hess= None , tol= 1e-8 ): 
     all_x_i =  [x0[0 ]] 
     all_y_i =  [x0[1 ]] 
     all_f_i =  [f(x0)] 
      
     def  store(X): 
         x, y =  X 
         all_x_i.append(x) 
         all_y_i.append(y) 
         all_f_i.append(f(X)) 
      
     optimize.minimize( 
       f, x0, tol= tol, 
       method= "Nelder-Mead" , callback= store  
     ) 
      
     return  all_x_i, all_y_i, all_f_i 
 
 
 
 
Trajectory 
 f, grad, hess =  mk_quad(0.7 ) 
 opt =  nelder_mead((1.6 , 1.1 ), f, grad) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f, traj= opt) 
 
 
 f, grad, hess =  mk_quad(0.05 ) 
 opt =  nelder_mead((1.6 , 1.1 ), f, grad) 
 plot_2d_traj((- 1 ,2 ), (- 1 ,2 ), f, traj= opt) 
 
 
 
 
Rosenbrock’s function 
 f, grad, hess =  mk_rosenbrock() 
 opt =  nelder_mead((1.6 , 1.1 ), f, grad) 
 plot_2d_traj((- 2 ,2 ), (- 2 ,2 ), f, traj= opt) 
 
 
 f, grad, hess =  mk_rosenbrock() 
 opt =  nelder_mead((- 0.5 , 0 ), f, grad) 
 plot_2d_traj((- 2 ,2 ), (- 2 ,2 ), f, traj= opt)