|
|
(One intermediate revision by one other user not shown) |
Line 1: |
Line 1: |
| '''Limited-memory BFGS''' ('''L-BFGS''' or '''LM-BFGS''') is an [[optimization (mathematics)|optimization]] [[algorithm]] in the family of [[quasi-Newton method]]s that approximates the [[BFGS method|Broyden–Fletcher–Goldfarb–Shanno (BFGS)]] algorithm using a limited amount of [[computer memory]]. It is a popular algorithm for parameter estimation in [[machine learning]].<ref>{{cite conference|first=Robert |last=Malouf |year=2002 |title=A comparison of algorithms for maximum entropy parameter estimation |conference=Proc. Sixth Conf. on Natural Language Learning (CoNLL) |pages=49–55 |url=http://acl.ldc.upenn.edu/W/W02/W02-2018.pdf}}</ref><ref name="owlqn"/>
| | Friends call him Royal. Managing people is how she makes cash and she will not alter it anytime soon. What he really enjoys doing is to perform handball but he is struggling to find time for it. My house is now in Kansas.<br><br>Review my web blog - [http://Christianculturecenter.org/ActivityFeed/MyProfile/tabid/61/userId/34117/Default.aspx best extended car warranty] |
| | |
| Like the original BFGS, L-BFGS uses an approximation to the inverse [[Hessian matrix]] to steer its search through variable space, but where BFGS stores a dense ''n''×''n'' approximation to the Hessian (''n'' being the number of variables in the problem), L-BFGS stores only a few vectors that represent the approximation implicitly. Due to its resulting linear memory requirement, the L-BFGS method is particularly well suited for optimization problems with a large number of variables. Instead of the inverse Hessian '''H'''''<sub>k</sub>'', L-BFGS maintains a history of the past ''m'' updates of the position '''x''' and gradient ∇''f''('''x'''), where generally the history size ''m'' can be small (often ''m''<10). These updates are used to implicitly do operations requiring the '''H'''''<sub>k</sub>''-vector product.
| |
| | |
| ==Algorithm==
| |
| L-BFGS shares many features with other quasi-Newton algorithms, but is very different in how the matrix-vector multiplication for finding the search direction is carried out <math>d_k=-H_k g_k\,\!</math>. There are multiple published approaches to using a history of updates to form this direction vector. Here, we give a common approach, the so-called "two loop recursion."<ref>{{Cite journal|doi=10.1002/nme.1620141104|first1=H.|last1= Matthies |first2= G.|last2= Strang|title=The solution of non linear finite element equations |year=1979|journal= International Journal for Numerical Methods in Engineering |volume=14|pages=1613–1626|issue=11}}</ref><ref>{{cite journal|doi=10.1090/S0025-5718-1980-0572855-7|first=J. |last=Nocedal|title= Updating Quasi-Newton Matrices with Limited Storage |year=1980|journal=Mathematics of Computation |volume=35|pages=773–782|issue=151}}</ref>
| |
| | |
| We'll take as given <math>x_k\,\!</math>, the position at the <math>k\,\!</math>-th iteration, and <math>g_k\equiv\nabla f(x_k)</math> where <math>f\,\!</math> is the function being minimized, and all vectors are column vectors. Then we keep the updates <math>s_k = x_{k+1} - x_k\,\!</math> and <math>y_k = g_{k+1} - g_k\,\!</math>. We'll define <math>\rho_k = \frac{1}{y^{\rm T}_k s_k} </math>, and <math>H^0_k\,\!</math> will be the 'initial' approximate of the inverse Hessian that our estimate at iteration <math>k\,\!</math> begins with.
| |
| Then we can compute the (uphill) direction as follows:
| |
| | |
| :<math>q = g_k\,\!</math>
| |
| :For <math>i=k-1, k-2, \ldots, k-m</math>
| |
| ::<math>\alpha_i = \rho_i s^{\rm T}_i q\,\!</math>
| |
| ::<math>q = q - \alpha_i y_i\,\!</math>
| |
| :<math>H_k=y^{\rm T}_k s_k/y^{\rm T}_k y_k</math>
| |
| :<math>z = H_k q</math>
| |
| :For <math>i=k-m, k-m+1, \ldots, k-1</math>
| |
| :: <math>\beta_i = \rho_i y^{\rm T}_i z\,\!</math>
| |
| :: <math>z = z + s_i (\alpha_i - \beta_i)\,\!</math>
| |
| :Stop with <math>H_k g_k = z\,\!</math>
| |
| | |
| This formulation is valid whether we are minimizing or maximizing. Note that if we are minimizing, the search direction would be the negative of z (since z is "uphill"), and if we are maximizing, <math>H^{0}_k</math> should be negative definite rather than positive definite. We would typically do a [[backtracking line search]] in the search direction (any [[line search]] would be valid, but L-BFGS does not require exact line searches in order to converge).
| |
| | |
| Commonly, the inverse Hessian <math>H^0_k\,\!</math> is represented as a diagonal matrix, so that initially setting <math>z\,\!</math> requires only an element-by-element multiplication.
| |
| | |
| This two loop update only works for the inverse Hessian. Approaches to implementing L-BFGS using the direct approximate Hessian <math>B_k\,\!</math> have also been developed, as have other means of approximating the inverse Hessian.<ref>{{cite journal|doi=10.1007/BF01582063|last1=Byrd|first1= R. H.|last2= Nocedal|first2=J.|last3= Schnabel|first3= R. B.|title=Representations of Quasi-Newton Matrices and their use in Limited Memory Methods|year=1994|journal= Mathematical Programming|volume=63|issue= 4|pages=129–156}}</ref>
| |
| | |
| ==Applications==
| |
| L-BFGS has been called "the algorithm of choice" for fitting [[Multinomial logit|log-linear (MaxEnt) models]] and [[conditional random field]]s with [[Regularization (mathematics)|<math>\ell_2</math>-regularization]].<ref name="owlqn">{{cite doi|10.1145/1273496.1273501}}</ref>
| |
| | |
| ==Variants==
| |
| Since BFGS (and hence L-BFGS) is designed to minimize [[smooth function|smooth]] functions without [[Constraint (mathematics)|constraints]], the L-BFGS algorithm must be modified to handle functions that include non-[[differentiable]] components or constraints. A popular class of modifications are called active-set methods, based on the concept of the [[active set]]. The idea is that when restricted to a small neighborhood of the current iterate, the function and constraints can be simplified.
| |
| | |
| ===L-BFGS-B===
| |
| The '''L-BFGS-B''' algorithm extends L-BFGS to handle simple box constraints (aka bound constraints) on variables; that is, constraints of the form <math>l_i \leq x_i \leq u_i</math> where ''l<sub>i</sub>'' and ''u<sub>i</sub>'' are per-variable constant lower and upper bounds, respectively (for each ''x<sub>i</sub>'', either or both bounds may be omitted).<ref name="LBFGSB1">{{Cite doi|10.1137/0916069}}</ref><ref name="algo778">{{cite journal|last1=Zhu|doi=10.1145/279232.279236|first1=C.|last2=Byrd|first2=Richard H.|last3=Lu|first3=Peihuang|last4=Nocedal|first4=Jorge |title=L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization |year=1997|journal= ACM Transactions on Mathematical Software|volume= 23|issue= 4|pages= 550–560}}</ref> The method works by identifying fixed and free variables at every step (using a simple gradient method), and then using the L-BFGS method on the free variables only to get higher accuracy, and then repeating the process.
| |
| | |
| ===OWL-QN===
| |
| '''Orthant-wise limited-memory quasi-Newton''' ('''OWL-QN''') is an L-BFGS variant for fitting [[Taxicab geometry|<math>\ell_1</math>]]-[[Regularization (mathematics)|regularized]] models, exploiting the inherent [[Sparse matrix|sparsity]] of such models.<ref name="owlqn"/>
| |
| It minimizes functions of the form
| |
| | |
| :<math>f(\vec x) = g(\vec x) + C \|\vec x\|_1</math>
| |
| | |
| where <math>g</math> is a [[Differentiable function|differentiable]] [[Convex function|convex]] [[loss function]]. The method is an active-set type method: at each iterate, it estimates the [[Sign (mathematics)|sign]] of each component of the variable, and restricts the subsequent step to have the same sign. Once the sign is fixed, the non-differentiable <math> \|\vec x\|_1</math> term becomes a smooth linear term which can be handled by L-BFGS. After a L-BFGS step, the method allows some variables to change sign, and repeats the process.
| |
| | |
| ===O-LBFGS===
| |
| Schraudolf ''et al.'' present an [[online machine learning|online]] approximation to both BFGS and L-BFGS.<ref>{{cite conference |title=A stochastic quasi-Newton method for online convex optimization |authors=N. Schraudolph, J. Yu, and S. Günter |conference=AISTATS |year=2007}}</ref>
| |
| | |
| ==Implementations==
| |
| An early, open source implementation of L-BFGS in Fortran exists in [[Netlib]] as a [[shar]] archive [http://netlib.org/opt/lbfgs_um.shar]. Multiple other open source implementations have been produced as translations of this Fortran code (e.g. [http://riso.sourceforge.net/ java], and [http://www.scipy.org/doc/api_docs/SciPy.optimize.lbfgsb.html#fmin_l_bfgs_b python] via [[SciPy]]). Other implementations exist (e.g. [http://www.mathworks.com/help/toolbox/optim/ug/fmincon.html Matlab (optimization toolbox)], [http://www.mathworks.com/matlabcentral/fileexchange/23245 Matlab (BSD)]), frequently as part of generic optimization libraries (e.g. [http://reference.wolfram.com/mathematica/tutorial/UnconstrainedOptimizationQuasiNewtonMethods.html Mathematica], [http://funclib.codeplex.com/ FuncLib C# library], and [http://dlib.net/optimization.html dlib C++ library]). The [http://www.chokkan.org/software/liblbfgs/ libLBFGS] is a [[C (programming language)|C]] implementation.
| |
| | |
| === Implementations of variants ===
| |
| The L-BFGS-B variant also exists as [http://toms.acm.org/ ACM TOMS] algorithm 778.<ref name="algo778"/> In February 2011, some of the authors of the original L-BFGS-B code posted a major update (version 3.0).
| |
| | |
| A reference implementation<ref name="LBFGSB_update">{{Cite doi|10.1145/2049662.2049669}}</ref> is available in [[Fortran_77#FORTRAN_77|Fortran 77]] (and with a [[Fortran#Fortran_90|Fortran 90]] interface) at the [http://users.eecs.northwestern.edu/~nocedal/lbfgsb.html author's website]. This version, as well as older versions, has been converted to many other languages, including a [http://www.mini.pw.edu.pl/~mkobos/programs/lbfgsb_wrapper/index.html Java wrapper] for v3.0; [[Matlab]] interfaces for [http://www.mathworks.com/matlabcentral/fileexchange/35104-lbfgsb-l-bfgs-b-mex-wrapper v3.0], [http://www.cs.toronto.edu/~liam/software.shtml v2.4], and [http://www.cs.ubc.ca/~pcarbo/lbfgsb-for-matlab.html v2.1]; a [http://code.google.com/p/otkpp/source/browse/trunk/otkpp/localsolvers/lbfgsb/LBFGSB.cpp?r=51 C++ interface] for v2.1;
| |
| a Python interface for v3.0 as part of [http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html scipy.optimize.minimize]; an [http://forge.ocamlcore.org/projects/lbfgs/ OCaml interface] for v2.1 and v3.0; version 2.3 has been converted to [[C (programming language)|C]] by [[f2c]] and is available at this [http://www.koders.com/c/fid4A53890DFB42BB9734639793C7BDD4EB1B8E6583.aspx?s=decomposition website]; and [[R (programming language)|R's]] <code>optim</code> general-purpose optimizer routine includes L-BFGS-B by using <code>method="L-BFGS-B"</code>.<ref name = "R-optim">{{cite web
| |
| | title = General-purpose Optimization
| |
| | url = http://finzi.psych.upenn.edu/R/library/stats/html/optim.html
| |
| | publisher = [[R (programming language)#CRAN|Comprehensive R Archive Network]]
| |
| | work = R documentation
| |
| }}
| |
| </ref>
| |
| | |
| OWL-QN implementations are available in:
| |
| * [http://research.microsoft.com/en-us/downloads/b1eb1016-1738-4bd5-83a9-370c9d498a03/ C++ implementation by its designers], includes the original ICML paper on the algorithm<ref name="owlqn"/>
| |
| * [http://www.umiacs.umd.edu/~msubotin/owlqn.py Python implementation] by Michael Subotin, intended for use with [[SciPy]]
| |
| * The [[Conditional random field|CRF]] toolkit [http://wapiti.limsi.fr Wapiti] includes a C implementation
| |
| | |
| ==Works cited==
| |
| {{reflist|2}}
| |
| | |
| ==Further reading==
| |
| *{{cite journal|doi= 10.1007/BF01589116|first=D. C.|last1= Liu |first2= J.|last2= Nocedal|url=http://www.ece.northwestern.edu/~nocedal/PSfiles/limited-memory.ps.gz|title= On the Limited Memory Method for Large Scale Optimization|year=1989|journal= Mathematical Programming B|volume=45|issue= 3|pages= 503–528}}
| |
| *{{cite journal|doi=10.1137/0916069|last1=Byrd|url=http://www.ece.northwestern.edu/~nocedal/PSfiles/limited.ps.gz|first1=Richard H.|last2=Lu|first2=Peihuang|last3=Nocedal|first3=Jorge|last4=Zhu|first4=Ciyou|title=A Limited Memory Algorithm for Bound Constrained Optimization|year=1995|journal= SIAM Journal on Scientific and Statistical Computing|pages=1190–1208|volume=16|issue= 5}}
| |
| | |
| {{Optimization algorithms|unconstrained}}
| |
| | |
| [[Category:Optimization algorithms and methods]]
| |
Friends call him Royal. Managing people is how she makes cash and she will not alter it anytime soon. What he really enjoys doing is to perform handball but he is struggling to find time for it. My house is now in Kansas.
Review my web blog - best extended car warranty