|
|
Line 1: |
Line 1: |
| '''Stochastic gradient descent''' is a [[gradient descent optimization]] [[iterative method|method]] for minimizing an [[objective function]] that is written as a sum of differentiable functions.
| | Nice to satisfy you, my name is Araceli Oquendo but I don't like when individuals use my complete name. Interviewing is what I do in my day occupation. For many years she's been residing in Kansas. To play croquet is some thing that I've done for many years.<br><br>Also visit my page - [http://Development.Childcareinfo.com/UserProfile/tabid/142/userId/18597/Default.aspx auto warranty] |
| | |
| == Background ==
| |
| {{Main|M-estimation}}
| |
| {{See also|Estimating equation}}
| |
| Both [[statistics|statistical]] [[M-estimation|estimation]] and [[machine learning]] consider the problem of minimizing an [[objective function]] that has the form of a sum:
| |
| : <math>Q(w) = \sum_{i=1}^n Q_i(w),</math>
| |
| where the [[parametric statistics|parameter]] <math>w</math> is to be [[estimator|estimated]] and where typically each summand function <math>Q_i( )</math> is associated with the <math>i</math>-th [[observation]] in the [[data set]] (used for training).
| |
| | |
| In classical statistics, sum-minimization problems arise in [[least squares]] and in [[maximum-likelihood estimation]] (for independent observations). The general class of estimators that arise as minimizers of sums are called [[M-estimator]]s. However, in statistics, it has been long recognized that requiring even local minimization is too restrictive for some problems of maximum-likelihood estimation, as shown for example by Thomas Ferguson's example.<ref>{{cite journal | last = Ferguson | first = Thomas S. | title = An inconsistent maximum likelihood estimate | journal = Journal of the American Statistical Association | volume = 77 | issue = 380 | year = 1982 | pages = 831–834 | jstor = 2287314 | doi = 10.1080/01621459.1982.10477894 }}</ref> Therefore, contemporary statistical theorists often consider [[stationary point]]s of the [[likelihood function]] (or zeros of its derivative, the [[Score (statistics)|score function]], and other [[estimating equations]]).
| |
| | |
| The sum-minimization problem also arises for [[empirical risk minimization]]: In this case, <math>Q_i(w)</math> is the value of [[loss function]] at <math>i</math>-th example, and <math>Q(w)</math> is the empirical risk.
| |
| | |
| When used to minimize the above function, a standard (or "batch") [[gradient descent]] method would perform the following iterations :
| |
| : <math>w := w - \alpha \nabla Q(w) = w - \alpha \sum_{i=1}^n \nabla Q_i(w),</math>
| |
| where <math>\alpha</math> is a step size (sometimes called the ''learning rate'' in machine learning).
| |
| | |
| In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. For example, in statistics, [[exponential families|one-parameter exponential families]] allow economical function-evaluations and gradient-evaluations.
| |
| | |
| However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. To economize on the computational cost at every iteration, stochastic gradient descent [[sampling (statistics)|samples]] a subset of summand functions at every step. This is very
| |
| effective in the case of large-scale machine learning problems.<ref>{{Cite web|first1=Léon|last1=Bottou|author1-link=Léon Bottou|last2=Bousquet|first2=Olivier|author2-link=Olivier Bousquet|contribution=The Tradeoffs of Large Scale Learning|url=http://leon.bottou.org/papers/bottou-bousquet-2008|title=Advances in Neural Information Processing Systems|volume=20|pages=161–168|year=2008|postscript=<!-- Bot inserted parameter. Either remove it; or change its value to "." for the cite to end in a ".", as necessary. -->{{inconsistent citations}}}}</ref>
| |
| | |
| == Iterative method ==
| |
| [[Image:stogra.png|thumb|right|Fluctuations in the total objective function as gradient steps with respect to mini-batches are taken.]]
| |
| | |
| In stochastic (or "on-line") gradient descent, the true gradient of <math>Q(w)</math> is approximated by a gradient at a single example:
| |
| : <math>w := w - \alpha \nabla Q_i(w).</math>
| |
| As the algorithm sweeps through the training set, it performs the above update for each training example. Several passes over the training set are made until the algorithm converges. Typical implementations may also randomly shuffle training examples at each pass and use an adaptive learning rate.
| |
| | |
| In pseudocode, stochastic gradient descent with shuffling of training set at each pass can be presented as follows:
| |
| <div style="margin-left: 35px; width: 600px">
| |
| {{framebox|blue}}
| |
| * Choose an initial vector of parameters <math>w</math> and learning rate <math>\alpha</math>.
| |
| * Repeat until an approximate minimum is obtained:
| |
| ** Randomly shuffle examples in the training set.
| |
| ** For <math>\! i=1, 2, ..., n</math>, do:
| |
| *** <math>\! w := w - \alpha \nabla Q_i(w).</math>
| |
| {{frame-footer}}
| |
| </div>
| |
| | |
| There is a compromise between the two forms, which is often called "mini-batches", where the true gradient is approximated by a sum over a small number of training examples.
| |
| | |
| The convergence of stochastic gradient descent has been analyzed using the theories of [[convex optimization|convex minimization]] and of [[stochastic approximation]]. Briefly, when the learning rates <math>\alpha</math> decrease with an appropriate rate,
| |
| and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum
| |
| when the objective function is [[convex function|convex]] or [[pseudoconvex function|pseudoconvex]],
| |
| and otherwise converges almost surely to a local minimum.<ref>{{Cite book
| |
| |last=Bottou
| |
| |first=Léon
| |
| |authorlink=Léon Bottou
| |
| |contribution=Online Algorithms and Stochastic Approximations
| |
| |year=1998
| |
| |title=Online Learning and Neural Networks
| |
| |publisher=Cambridge University Press
| |
| |url=http://leon.bottou.org/papers/bottou-98x
| |
| |isbn=978-0-521-65263-6
| |
| |postscript=<!-- Bot inserted parameter. Either remove it; or change its value to "." for the cite to end in a ".", as necessary. -->{{inconsistent citations}}
| |
| }}</ref>
| |
| <ref>{{cite article
| |
| |last=Kiwiel
| |
| |first=Krzysztof C.
| |
| |title=Convergence and efficiency of subgradient methods for quasiconvex minimization
| |
| |journal=Mathematical Programming (Series A)
| |
| |publisher=Springer|location=Berlin, Heidelberg
| |
| |issn=0025-5610|pages=1–25|volume=90|issue=1
| |
| |doi=10.1007/PL00011414|doi=10.1007/PL00011414|year=2001 |mr=1819784}}</ref>
| |
| This is in fact a consequence of the Robbins-Siegmund theorem.<ref>{{Cite book
| |
| |last1=Robbins
| |
| |first1=Herbert
| |
| |author1-link=Herbert Robbins
| |
| |last2=Siegmund
| |
| |first2=David O.
| |
| |author2-link=David O. Siegmund
| |
| |contribution=A convergence theorem for non negative almost supermartingales and some applications
| |
| |title=Optimizing Methods in Statistics
| |
| |publisher=Academic Press
| |
| |year=1971
| |
| |editor-last=Rustagi
| |
| |editor-first=Jagdish S.
| |
| |postscript=<!-- Bot inserted parameter. Either remove it; or change its value to "." for the cite to end in a ".", as necessary. -->{{inconsistent citations}}
| |
| }}
| |
| </ref>
| |
| | |
| == Example ==
| |
| Let's suppose we want to fit a straight line <math>y = \! w_1 + w_2 x</math> to a training set of two-dimensional points <math>\! (x_1, y_1), \ldots, (x_n, y_n)</math> using [[least squares]]. The objective function to be minimized is:
| |
| : <math>Q(w) = \sum_{i=1}^n Q_i(w) = \sum_{i=1}^n \left(w_1 + w_2 x_i - y_i\right)^2.</math>
| |
| | |
| The last line in the above pseudocode for this specific problem will become:
| |
| : <math>\begin{bmatrix} w_1 \\ w_2 \end{bmatrix} :=
| |
| \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}
| |
| - \alpha \begin{bmatrix} 2(w_1 + w_2 x_i - y_i) \\ 2x_i(w_1 + w_2 x_i - y_i) \end{bmatrix}.</math>
| |
| | |
| == Applications ==
| |
| Stochastic gradient descent is a popular algorithm for training a wide range of models in [[machine learning]], including (linear) [[support vector machine]]s, [[logistic regression]] (see, e.g., [[Vowpal Wabbit]]) and [[graphical model]]s.<ref>Jenny Rose Finkel, Alex Kleeman, Christopher D. Manning (2008). Efficient, Feature-based, Conditional Random Field Parsing. Proc. Annual Meeting of the ACL.</ref>
| |
| It competes with the [[limited-memory BFGS|L-BFGS]] algorithm, which is also widely used. SGD has been used since at least 1960 for training [[linear regression]] models, originally under the name [[ADALINE]].<ref>{{cite web |author=Avi Pfeffer |title=CS181 Lecture 5 — Perceptrons |url=http://www.seas.harvard.edu/courses/cs181/files/lecture05-notes.pdf |publisher=Harvard University}}</ref>
| |
| | |
| When combined with the [[backpropagation]] algorithm, it is the ''de facto'' standard algorithm for training (shallow) [[artificial neural network]]s.
| |
| | |
| Another popular stochastic gradient descent algorithm is the [[Least mean squares filter|least mean squares (LMS)]] adaptive filter.
| |
| | |
| == References ==
| |
| <references/>
| |
| | |
| * {{cite book
| |
| | last = Bertsekas
| |
| | first = Dimitri
| |
| | authorlink = Dimitri P. Bertsekas
| |
| | title = Convex Analysis and Optimization
| |
| | publisher = Athena Scientific
| |
| | year = 2003
| |
| }}
| |
| | |
| * {{cite book
| |
| | last = Bertsekas
| |
| | first = Dimitri P.
| |
| | authorlink = Dimitri P. Bertsekas
| |
| | title = Nonlinear Programming
| |
| | publisher = Athena Scientific
| |
| | year = 1999|edition=Second
| |
| | location = Cambridge, MA.
| |
| | isbn = 1-886529-00-0
| |
| }}
| |
| | |
| * {{Cite book
| |
| |last=Bottou
| |
| |first=Léon
| |
| |authorlink=Léon Bottou
| |
| |contribution=Stochastic Learning
| |
| |year=2004
| |
| |title=Advanced Lectures on Machine Learning
| |
| |pages=146–168
| |
| |publisher=LNAI 3176, Springer Verlag
| |
| |url=http://leon.bottou.org/papers/bottou-mlss-2004
| |
| |isbn=978-3-540-23122-6
| |
| |postscript=<!-- Bot inserted parameter. Either remove it; or change its value to "." for the cite to end in a ".", as necessary. -->{{inconsistent citations}}
| |
| }}
| |
| | |
| * {{cite article|last=Davidon|first=W. C.|authorlink=William C. Davidon|title=New least-square algorithms|doi=10.1007/BF00935703|journal=Journal of Optimization Theory and Applications|volume=18|year=1976|number=2|pages=187–197|doi=10.1007/BF00935703|mr=418461}}
| |
| | |
| * {{cite article|last=Kiwiel|first=Krzysztof C.|title=Convergence of approximate and incremental subgradient methods for convex optimization| doi=10.1137/S1052623400376366|journal=SIAM Journal of Optimization| volume=14|year=2003|number= 3|pages=807–840|doi=10.1137/S1052623400376366|mr=2085944}} (Extensive list of references)
| |
| | |
| * ''Pattern Classification'' by Richard O. Duda, Peter E. Hart, David G. Stork, ISBN 0-471-05669-3, 2000
| |
| | |
| * ''Introduction to Stochastic Search and Optimization'' by James C. Spall, ISBN 0-471-33052-3, 2003
| |
| | |
| == Software ==
| |
| * [http://leon.bottou.org/projects/sgd sgd]: an LGPL C++ library which uses stochastic gradient descent to fit [[Support vector machine|SVM]] and [[conditional random field]] models.
| |
| * [http://klcl.pku.edu.cn/member/sunxu/code.htm CRF-ADF] A [[C Sharp (programming language)|C#]] toolkit of stochastic gradient descent and its feature-frequency-adaptive variation for training [[conditional random field]] models.
| |
| | |
| == External links ==
| |
| * [http://codingplayground.blogspot.it/2013/05/stocastic-gradient-descent.html Using stochastic gradient descent in C++, Boost, Ublas for linear regression]
| |
| | |
| [[Category:Stochastic optimization]]
| |
| [[Category:Computational statistics]]
| |
| [[Category:M-estimators]]
| |
| [[Category:Machine learning algorithms]]
| |