|
|
(One intermediate revision by one other user not shown) |
Line 1: |
Line 1: |
| {{technical|date=June 2012}}
| | Marvella is what you can contact her but it's not the most feminine title out there. Managing individuals is what I do and the salary has been truly satisfying. One of the things she loves most is to do aerobics and now she is attempting to make cash with it. Years ago we moved to North Dakota.<br><br>Feel free to surf to my web page - [http://javly.com/?p=56362 javly.com] |
| '''Stochastic approximation''' methods are a family of iterative [[stochastic optimization]] [[algorithm]]s that attempt to find zeroes or extrema of functions which cannot be computed directly, but only estimated via noisy observations.
| |
| | |
| Mathematically, this refers to solving:
| |
| | |
| :<math> \min_{x \in \Theta}\; f(x) = \mathbb E[F(x,\xi)] </math>
| |
| where the objective is to find the parameter <math>x \in \Theta</math>, which minimizes <math>f(x)</math> for some unknown random variable, <math>\xi </math>. Denoting <math>d</math> as the dimension of the parameter <math>x </math>, we can assume that while the domain <math>\Theta \subset \mathbb R^d </math> is known, the objective function, <math>f(x)</math>, cannot be computed exactly, but instead approximated via simulation. This can be intuitively explained as follows. <math>f(x)</math> is the original function we want to minimize. However, due to noise, <math>f(x)</math> can not be evaluated exactly. This situation is modeled by the function <math>F(x,\xi)</math>, where <math>\xi</math> represents the noise and is a random variable. Since <math>\xi</math> is a random variable, so is the value of <math>F(x,\xi)</math>. The objective is then to minimize <math>f(x)</math>, but through evaluating <math>F(x,\xi)</math>. A reasonable way to do this is to minimize the expectancy of <math>F(x,\xi)</math>, i.e., <math>\mathbb E[F(x,\xi)]</math>.
| |
| | |
| The first, and prototypical, algorithms of this kind are the '''Robbins-Monro''' and '''Kiefer-Wolfowitz''' algorithms.
| |
| | |
| ==Robbins–Monro algorithm==
| |
| The Robbins–Monro algorithm, introduced in 1951 by [[Herbert Robbins]] and Sutton Monro,<ref name="rm">{{cite doi|10.1214/aoms/1177729586}}</ref> presented a methodology for solving a root finding problem, where the function is represented as an expected value. Assume that we have a function <math>M(x)</math>, and a constant <math>\alpha</math>, such that the equation <math>M(x) = \alpha</math> has a unique root at <math>x=\theta</math>. It is assumed that while we cannot directly observe the function <math>M(x)</math>, we can instead obtain measurements of the random variable <math>N(x)</math> where <math>\mathbb E[N(x)] = M(x)</math>. The structure of the algorithm is to then generate iterates of the form:
| |
| | |
| ::<math>x_{n+1}-x_n=a_n(\alpha-N(x_n))</math>
| |
| | |
| Here, <math>a_1, a_2, \dots</math> is a sequence of positive step sizes. [[Herbert Robbins|Robbins]] and Monro proved <ref name="rm" /><sup>, Theorem 2</sup> that <math>x_n</math> [[convergence of random variables|converges]] in <math>L^2</math> (and hence also in probability) to <math>\theta</math> provided that:
| |
| * <math>N(x)</math> is uniformly bounded,
| |
| * <math>M(x)</math> is nondecreasing,
| |
| * <math>M'(\theta)</math> exists and is positive, and
| |
| * The sequence <math>a_n</math> satisfies the following requirements:
| |
| :: <math>\qquad \sum^{\infty}_{n=0}a_n = \infty \quad \mbox{ and } \quad \sum^{\infty}_{n=0}a^2_n < \infty \quad </math>
| |
| | |
| A particular sequence of steps which satisfy these conditions, and was suggested by Robbins–Monro, have the form: <math>a_n=a/n</math>, for <math> a > 0 </math>. Other series are possible but in order to average out the noise in <math>N(x)</math>, the above condition must be met.
| |
| | |
| ===Complexity results===
| |
| #If <math>f(x)</math> is twice continuously differentiable, and strongly convex, and the minimizer of <math>f(x)</math> belongs to the interior of <math>\Theta</math>, then the Robbins-Monro algorithm will achieve the asymptotically optimal convergence rate, with respect to the objective function, being <math>\mathbb E[f(x_n) - f^*] = O(1/n)</math>, where <math>f^*</math> is the minimal value of <math>f(x)</math> over <math>x \in \Theta</math>.<ref name="jsacks">{{cite doi|10.1214/aoms/1177706619}}</ref><ref name="NJLS">{{cite doi|10.1137/070704277}}</ref>
| |
| # Conversely, in the general convex case, where we lack both the assumption of smoothness and strong convexity, Nemirovski and Yudin <ref name="NYcomp">Problem Complexity and Method Efficiency in Optimization, A. Nemirovski and D. Yudin, ''Wiley -Intersci. Ser. Discrete Math'' '''15''' ''John Wiley'' ''New York'' (1983) .</ref> have shown that the asymptotically optimal convergence rate, with respect to the objective function values, is <math>O(1/\sqrt{n})</math>. They have also proven that this rate cannot be improved.
| |
| | |
| ===Subsequent developments===
| |
| While the Robbins-Monro algorithm is theoretically able to achieve <math> O(1/n)</math> under the assumption of twice continuous differentiability and strong convexity, it can perform quite poorly upon implementation. This is primarily due to the fact that the algorithm is very sensitive to the choice of the step size sequence, and the supposed asymptotically optimal step size policy can be quite harmful in the beginning.<ref name="NJLS" /><ref name="jcsbook">Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control, J.C. Spall, ''John Wiley'' ''Hoboken, NJ'', (2003).</ref>
| |
| | |
| To overcome this shortfall, Polyak and Juditsky,<ref name="pj">{{cite doi|10.1137/0330046}}</ref> presented a method of accelerating Robbins-Monro through the use of longer steps, and averaging of the iterates. The algorithm would have the following structure:
| |
| ::<math> x_{n+1} - x_n = b_n(\alpha - N(x_n)), \qquad \bar{x}_n = \frac{1}{n} \sum^{n-1}_{i=0} x_i </math>
| |
| The convergence of <math> \bar{x}_n </math> to the unique root <math>\theta</math> relies on the condition that the step sequence <math>\{b_n\}</math> decreases sufficiently slowly. That is
| |
| ::<math> b_n \rightarrow 0, \qquad \frac{b_n - b_{n+1}}{b_n} = o(b_n)</math>
| |
| Therefore, the sequence <math>b_n = n^{-\alpha}</math> with <math>0 < \alpha < 1</math> satisfies this restriction, but <math>\alpha = 1</math> does not, hence the longer steps. Under the assumptions outlined in the Robbins-Monro algorithm, the resulting modification will result in the same asymptotically optimal convergence rate <math>O(1/n)</math> yet with a more robust step size policy.<ref name ="pj" />
| |
| | |
| Prior to this, the idea of using longer steps and averaging the iterates had already been proposed by Nemirovski and Yudin <ref name="NY">On Cezari's convergence of the steepest descent method for approximating saddle points of convex-concave functions, A. Nemirovski and D. Yudin, ''Dokl. Akad. Nauk SSR'' '''2939''', (1978 (Russian)), Soviet Math. Dokl. '''19''' (1978 (English)).</ref> for the cases of solving the stochastic optimization problem with continuous convex objectives and for convex-concave saddle point problems. These algorithms were observed to attain the nonasymptotic rate <math>O(1/\sqrt{n})</math>.
| |
| | |
| ==Kiefer-Wolfowitz algorithm==
| |
| The Kiefer-Wolfowitz algorithm,<ref name = "KW">{{cite doi|10.1214/aoms/1177729392}}</ref> was introduced in 1952, and was motivated by the publication of the Robbins-Monro algorithm. However, the algorithm was presented as a method which would stochastically estimate the maximum of a function. Let <math>M(x) </math> be a function which has a maximum at the point <math>\theta </math>. It is assumed that <math>M(x)</math> is unknown, however, certain observations <math>N(x)</math>, where <math>\mathbb E[N(x)] = M(x)</math>, can be made at any point <math>x</math>. The structure of the algorithm follows a gradient-like method, with the iterates being generated as follows:
| |
| ::<math> x_{n+1} = x_n + a_n \bigg(\frac{N(x_n + c_n) - N(x_n -c_n)}{c_n} \bigg) </math>
| |
| where the gradient of <math>M(x)</math> is approximated using finite differences. The sequence <math>\{c_n\}</math> specifies the sequence of finite difference widths used for the gradient approximation, while the sequence <math>\{a_n\}</math> specifies a sequence of positive step sizes taken along that direction. Kiefer and Wolfowitz proved that, if <math>M(x)</math> satisfied certain regularity conditions, then <math>x_n</math> will converge to <math>\theta</math> provided that:
| |
| * The function <math>f(x)</math> has a unique point of maximum (minimum) and is strong concave (convex)
| |
| ** The algorithm was first presented with the requirement that the function <math>f(\cdot)</math> maintains strong global convexity (concavity) over the entire feasible space. Given this condition is too restrictive to impose over the entire domain, Kiefer and Wolfowitz proposed that it is sufficient to impose the condition over a compact set <math>C_0 \subset \mathbb R^d</math> which is known to include the optimal solution.
| |
| *The selected sequences <math>\{a_n\}</math> and <math>\{c_n\}</math> must be infinite sequences of positive numbers such that:
| |
| ::<math>\mbox{1. } \quad c_n \rightarrow 0, \quad a_n \rightarrow 0 \quad \mbox{ as } \quad n \rightarrow \infty </math>
| |
| | |
| ::<math> \mbox{2. } \quad \sum^\infty_{n=0} a_n = \infty, \qquad \sum^\infty_{n=0} \frac{a^2_n}{c^2_n} < \infty </math>
| |
| | |
| A suitable choice of sequences, as recommended by Kiefer and Wolfowitz, would be <math>a_n = 1/n</math> and <math>c_n = n^{-1/3}</math>.
| |
| | |
| ===Subsequent developments and important issues===
| |
| #The Kiefer Wolfowitz algorithm requires that for each gradient computation, at least <math>d+1</math> different parameter values must be simulated for every iteration of the algorithm, where <math>d </math> is the dimension of the search space. This means that when <math>d</math> is large, the Kiefer-Wolfowitz algorithm will require substantial computational effort per iteration, leading to slow convergence.
| |
| ## To address this problem, Spall, proposed the use of [[Simultaneous perturbation stochastic approximation|simultaneous perturbations]] to estimate the gradient. This method would require only two simulations per iteration, regardless of the dimension <math>d</math>.<ref name = "Jsp">{{cite doi|10.1109/TAC.2000.880982}}</ref>
| |
| #In the conditions required for convergence, the ability to specify a predetermined compact set that fulfills strong convexity (or concavity) and contains the unique solution can be difficult to find. With respect to real world applications, if the domain is quite large, these assumptions can be fairly restrictive and highly unrealistic.
| |
| | |
| ==Further developments==
| |
| An extensive theoretical literature has grown up around these algorithms, concerning conditions for convergence, rates of convergence, multivariate and other generalizations, proper choice of step size, possible noise models, and so on.<ref name="kushneryin">{{cite doi|10.1007/978-1-4899-2696-8}}</ref><ref>''Stochastic Approximation and Recursive Estimation'', Mikhail Borisovich Nevel'son and Rafail Zalmanovich Has'minskiĭ, translated by Israel Program for Scientific Translations and B. Silver, Providence, RI: American Mathematical Society, 1973, 1976. ISBN 0-8218-1597-0.</ref> These methods are also applied in [[control theory]], in which case the unknown function which we wish to optimize or find the zero of may vary in time. In this case, the step size <math>a_n</math> should not converge to zero but should be chosen so as to track the function.<ref name="kushneryin"/><sup>, 2nd ed., chapter 3</sup>
| |
| | |
| [[C. Johan Masreliez]] and [[R. Douglas Martin]] were the first to apply
| |
| stochastic approximation to [[Robust statistics|robust]] [[estimation]].<ref>{{cite doi|10.1109/TIT.1975.1055386}}</ref>
| |
| | |
| ==See also==
| |
| *[[Stochastic gradient descent]]
| |
| *[[Stochastic optimization]]
| |
| *[[Simultaneous perturbation stochastic approximation]]
| |
| | |
| ==References==
| |
| {{reflist}}
| |
| | |
| {{DEFAULTSORT:Stochastic Approximation}}
| |
| [[Category:Stochastic optimization]]
| |
| [[Category:Statistical approximations]]
| |