Bromley equation: Difference between revisions

Revision as of 03:08, 30 November 2013

Simultaneous perturbation stochastic approximation (SPSA) is an algorithmic method for optimizing systems with multiple unknown parameters. It is a type of stochastic approximation algorithm. As an optimization method, it is appropriately suited to large-scale population models, adaptive modeling, simulation optimization, and atmospheric modeling. Many examples are presented at the SPSA website http://www.jhuapl.edu/SPSA. A comprehensive recent book on the subject is Bhatnagar et al. (2013). An early paper on the subject is Spall (1987) and the foundational paper providing the key theory and justification is Spall (1992).

SPSA is a descent method capable of finding global minima. Its main feature is the gradient approximation that requires only two measurements of the objective function, regardless of the dimension of the optimization problem. Recall that we want to find the optimal control $u^{*}$ with loss function $J (u)$ :

u^{*} = \arg \min_{u \in U} J (u) .

Both Finite Differences Stochastic Approximation (FDSA) and SPSA use the same iterative process:

u_{n + 1} = u_{n} - a_{n} {\hat{g}}_{n} (u_{n}),

where $u_{n} = ((u_{n})_{1}, (u_{n})_{2}, \dots, (u_{n})_{p})^{T}$ represents the $n^{t h}$ iterate, ${\hat{g}}_{n} (u_{n})$ is the estimate of the gradient of the objective function $g (u) = \frac{\partial}{\partial u} J (u)$ evaluated at $u_{n}$ , and ${a_{n}}$ is a positive number sequence converging to 0. If $u_{n}$ is a p-dimensional vector, the $i^{t h}$ component of the symmetric finite difference gradient estimator is:

FD:

(\hat{g_{n}} (u_{n}))_{i} = \frac{J (u_{n} + c_{n} e_{i}) - J (u_{n} - c_{n} e_{i})}{2 c_{n}},

1 ≤i ≤p, where $e_{i}$ is the unit vector with a 1 in the $i^{t h}$ place, and $c_{n}$ is a small positive number that decreases with n. With this method, 2p evaluations of J for each $g_{n}$ are needed. Clearly, when p is large, this estimator loses efficiency.

Let now $Δ_{n}$ be a random perturbation vector. The $i^{t h}$ component of the stochastic perturbation gradient estimator is:

SP:

(\hat{g_{n}} (u_{n}))_{i} = \frac{J (u_{n} + c_{n} Δ_{n}) - J (u_{n} - c_{n} Δ_{n})}{2 c_{n} (Δ_{n})_{i}} .

Remark that FD perturbs only one direction at a time, while the SP estimator disturbs all directions at the same time (the numerator is identical in all p components). The number of loss function measurements needed in the SPSA method for each $g_{n}$ is always 2, independent of the dimension p. Thus, SPSA uses p times fewer function evaluations than FDSA, which makes it a lot more efficient.

Simple experiments with p=2 showed that SPSA converges in the same number of iterations as FDSA. The latter follows approximately the steepest descent direction, behaving like the gradient method. On the other hand, SPSA, with the random search direction, does not follow exactly the gradient path. In average though, it tracks it nearly because the gradient approximation is an almost unbiased estimator of the gradient, as shown in the following lemma.

Convergence lemma

Denote by

b_{n} = E [{\hat{g}}_{n} | u_{n}] - \nabla J (u_{n})

the bias in the estimator ${\hat{g}}_{n}$ . Assume that ${(Δ_{n})_{i}}$ are all mutually independent with zero-mean, bounded second moments, and $E (| (Δ_{n})_{i} |^{- 1})$ uniformly bounded. Then $b_{n}$ →0 w.p. 1.

Sketch of the proof

The main idea is to use conditioning on $Δ_{n}$ to express $E [({\hat{g}}_{n})_{i}]$ and then to use a second order Taylor expansion of $J (u_{n} + c_{n} Δ_{n})_{i}$ and $J (u_{n} - c_{n} Δ_{n})_{i}$ . After algebraic manipulations using the zero mean and the independence of ${(Δ_{n})_{i}}$ , we get

E [({\hat{g}}_{n})_{i}] = (g_{n})_{i} + O (c_{n}^{2})

The result follows from the hypothesis that $c_{n}$ →0.

Next we resume some of the hypotheses under which $u_{t}$ converges in probability to the set of global minima of $J (u)$ . The efficiency of the method depends on the shape of $J (u)$ , the values of the parameters $a_{k}$ and $c_{k}$ and the distribution of the perturbation terms $Δ_{k i}$ . First, the algorithm parameters must satisfy the following conditions:

$a_{t}$ >0, $a_{t}$ →0 when t→∝ and $\sum_{t = 1}^{\infty} a_{t} = \infty$ . A good choice would be $a_{t} = \frac{a}{t};$ a>0;
$c_{t} = \frac{c}{t^{γ}}$ , where c>0, $γ \in [\frac{1}{6}, \frac{1}{2}]$ ;
$\sum_{t = 1}^{\infty} (\frac{a_{t}}{c_{t}})^{2} < \infty$
$Δ_{t i}$ must be mutually independent zero-mean random variables, symmetrically distributed about zero, with $Δ_{k i} < a_{1} < \infty$ . The inverse first and second moments of the $Δ_{t i}$ must be finite.

A good choice for $Δ_{k i}$ is Bernoulli +-1 with probability 0.5 (other choices are possible too). The uniform and normal distributions do not satisfy the finite inverse moment conditions, so can not be used.

The loss function J(u) must be thrice continuously differentiable and the individual elements of the third derivative must be bounded: $| J^{(3)} (u) | < a_{3} < \infty$ . Also, |J(u)|→∝ as u→∝.

In addition, $\nabla J$ must be Lipschitz continuous, bounded and the ODE $\dot{u} = g (u)$ must have a unique solution for each initial condition. Under these conditions and a few others, $u_{k}$ converges in probability to the set of global minima of J(u) (see Maryak and Chin, 2008).

References

Bhatnagar, S., Prasad, H. L., and Prashanth, L. A. (2013), Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods, Springer.
Hirokami, T., Maeda, Y., Tsukada, H. (2006) "Parameter estimation using simultaneous perturbation stochastic approximation", Electrical Engineering in Japan, 154 (2), 30–3 [1]
Maryak, J.L., and Chin, D.C. (2008), "Global Random Optimization by Simultaneous Perturbation Stochastic Approximation," IEEE Transactions on Automatic Control, vol. 53, pp. 780-783.
Spall, J. C. (1987), “A Stochastic Approximation Technique for Generating Maximum Likelihood Parameter Estimates,” Proceedings of the American Control Conference, Minneapolis, MN, June 1987, pp. 1161–1167.
Spall, J. C. (1992), “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation,” IEEE Transactions on Automatic Control, vol. 37(3), pp. 332–341.
Spall, J.C. (1998). "Overview of the Simultaneous Perturbation Method for Efficient Optimization" 2. Johns Hopkins APL Technical Digest, 19(4), 482–492.
Spall, J.C. (2003) Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley. ISBN 0-471-33052-3 (Chapter 7)

@@ Line 1: / Line 1: @@
+'''Simultaneous perturbation stochastic approximation''' (SPSA) is an [[algorithmic]] method for optimizing systems with multiple unknown [[parameters]]. It is a type of [[stochastic approximation]] algorithm. As an optimization method, it is appropriately suited to large-scale population models, adaptive modeling, simulation [[optimization]], and [[atmospheric model]]ing. Many examples are presented at the SPSA website http://www.jhuapl.edu/SPSA. A comprehensive recent book on the subject is Bhatnagar et al. (2013). An early paper on the subject is Spall (1987) and the foundational paper providing the key theory and justification is Spall (1992).
+SPSA is a descent method capable of finding global minima. Its main feature is the gradient approximation that requires only two measurements of the objective function, regardless of the dimension of the optimization problem. Recall that we want to find the optimal control <math>u^*</math> with loss
+function <math>J(u)</math>:
-In fact, the [http://answers.yahoo.com/search/search_result?p=discerning+consumer&submit-go=Search+Y!+Answers discerning consumer] may want to consider its own share of advantages and disadvantages If you have any inquiries with regards to the place and how to use [http://secondwatersoftenerslizard.officialgottagotravel.net/ different types of water softener systems], you can make contact with us at the website. .
+:<math>u^* = \arg  \min_{u \in U} J(u).</math>
+Both [[Finite Differences Stochastic Approximation]] (FDSA)
+and SPSA use the same iterative process:
+:<math>u_{n+1} = u_n - a_n\hat{g}_n(u_n),</math>
+where <math>u_n=((u_n)_1,(u_n)_2,\ldots,(u_n)_p)^T</math>
+represents the <math>n^{th}</math> iterate, <math>\hat{g}_n(u_n)</math> is the estimate of the gradient of the objective function <math>g(u)= \frac{\partial}{\partial u}J(u)</math> evaluated at <math>{u_n}</math>, and <math>\{a_n\}</math> is a positive number sequence converging to 0. If <math>u_n</math> is a ''p''-dimensional vector, the <math>i^{th}</math> component of the [[symmetric]] finite difference gradient estimator is:
+:'''FD:''' <math>(\hat{g_n}(u_n))_i = \frac{J(u_n+c_ne_i)-J(u_n-c_ne_i)}{2c_n},</math>
+''1 ≤i ≤p'', where <math>e_i</math> is the unit vector with a 1 in the <math>i^{th}</math>
+place, and <math>c_n</math>is a small positive number that decreases with ''n''. With this method, ''2p'' evaluations of ''J'' for each <math>g_n</math> are needed. Clearly, when ''p'' is large, this estimator loses efficiency.
+Let now  <math>\Delta_n</math> be a random perturbation vector. The <math>i^{th}</math> component of the stochastic perturbation gradient estimator is:
+:'''SP:''' <math>(\hat{g_n}(u_n))_i = \frac{J(u_n+c_n\Delta_n)-J(u_n-c_n\Delta_n)}{2c_n(\Delta_n)_i}.</math>
+Remark that FD perturbs only one direction at a time, while the SP estimator disturbs all directions at the same time (the numerator is identical in all ''p'' components). The number of loss function measurements needed in the SPSA method for each <math>g_n</math> is always 2, independent of the [[dimension]] ''p''. Thus, SPSA uses ''p'' times fewer function evaluations than FDSA, which makes it a lot more efficient.
+Simple experiments with ''p=2'' showed that SPSA converges in the same number of iterations as FDSA. The latter follows [[Approximation|approximately]] the [[steepest]] descent direction, behaving like the gradient method. On the other hand, SPSA, with the random search direction, does not follow exactly the gradient path. In average though, it tracks it nearly because the gradient approximation is an almost [[unbiased]]
+estimator of the gradient, as shown in the following lemma.
+== Convergence lemma ==
+Denote by
+:<math>b_n = E[\hat{g}_n|u_n] -\nabla J(u_n) </math>
+the bias in the estimator <math>\hat{g}_n</math>. Assume that <math>\{(\Delta_n)_i\}</math> are all mutually independent with zero-mean, bounded second
+moments, and <math>E(|(\Delta_n)_i|^{-1})</math> uniformly bounded. Then <math>b_n</math>→0 w.p.&nbsp;1.
+== Sketch of the proof ==
+The main [[idea]] is to use conditioning on <math>\Delta_n</math> to express <math>E[(\hat{g}_n)_i]</math> and then to use a second order Taylor expansion of <math>J(u_n+c_n\Delta_n)_i</math> and <math>J(u_n-c_n\Delta_n)_i</math>. After algebraic manipulations using the zero mean and the independence of <math>\{(\Delta_n)_i\}</math>, we get
+:<math>E[(\hat{g}_n)_i]=(g_n)_i + O(c_n^2)</math>
+The result follows from the [[hypothesis]] that <math>c_n</math>→0.
+Next we resume some of the hypotheses under which <math>u_t</math> converges in [[probability]] to the set of global minima of <math>J(u)</math>. The efficiency of
+the method depends on the shape of <math>J(u)</math>, the values of the parameters <math>a_k</math> and <math>c_k</math> and the distribution of the perturbation terms <math>\Delta_{ki}</math>. First, the algorithm parameters must satisfy the
+following conditions:
+*  <math>a_t</math> >0, <math>a_t</math>→0 when t→∝ and <math>\sum_{t=1}^{\infty} a_t = \infty </math>. A good choice would be <math>a_t=\frac{a}{t};</math> a>0;
+*  <math>c_t=\frac{c}{t^\gamma}</math>, where c>0, <math> \gamma \in \left[\frac{1}{6},\frac{1}{2}\right]</math>;
+* <math>\sum_{t=1}^{\infty} (\frac {a_t}{c_t})^2 < \infty </math>
+* <math> \Delta_{ti} </math> must be mutually independent zero-mean random variables, symmetrically distributed about zero, with <math>\Delta_{ki} < a_1 < \infty </math>. The inverse first and second moments of the <math> \Delta_{ti} </math> must be finite.
+A good choice for <math>\Delta_{ki}</math> is Bernoulli +-1 with probability 0.5 (other choices are possible too). The uniform and normal distributions do not satisfy the finite inverse moment conditions, so can not be used.
+The loss function ''J(u)'' must be thrice continuously [[differentiable]] and the individual elements of the third derivative must be bounded: <math>|J^{(3)}(u)| < a_3 < \infty </math>. Also, ''|J(u)|→∝'' as ''u→∝''.
+In addition, <math>\nabla J</math> must be Lipschitz continuous, bounded and the ODE <math> \dot{u}=g(u)</math> must have a unique solution for each initial condition.
+Under these conditions and a few others, <math>u_k</math> [[Convergence (mathematics)|converges]] in probability to the set of global minima of J(u) (see Maryak and Chin, 2008).
+==References==
+* Bhatnagar, S., Prasad, H. L., and Prashanth, L. A. (2013), ''Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods'', Springer.
+* Hirokami, T., Maeda, Y., Tsukada, H. (2006) "Parameter estimation using simultaneous perturbation stochastic approximation", Electrical Engineering in Japan, 154 (2), 30–3 [http://dx.doi.org/10.1002/eej.20239]
+* Maryak, J.L., and Chin, D.C. (2008), "Global Random Optimization by Simultaneous Perturbation Stochastic Approximation," ''IEEE Transactions on Automatic Control'', vol. 53, pp. 780-783.
+* Spall, J. C. (1987), “A Stochastic Approximation Technique for Generating Maximum Likelihood Parameter Estimates,” ''Proceedings of the American Control Conference'', Minneapolis, MN, June 1987, pp. 1161–1167.
+* Spall, J. C. (1992), “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation,” ''IEEE Transactions on Automatic Control'', vol. 37(3), pp. 332–341.
+* Spall, J.C. (1998). "Overview of the Simultaneous Perturbation Method for Efficient Optimization" [http://www.jhuapl.edu/SPSA/PDF-SPSA/Spall_An_Overview.PDF 2]. ''Johns Hopkins APL Technical Digest'', 19(4), 482–492.
+* Spall, J.C. (2003) ''Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control'', Wiley. ISBN 0-471-33052-3 (Chapter 7)
+<references/>
+[[Category:Numerical climate and weather models]]
+[[Category:Stochastic algorithms]]
+[[Category:Optimization algorithms and methods]]

Bromley equation: Difference between revisions

Revision as of 03:08, 30 November 2013

Convergence lemma

Sketch of the proof

References

Navigation menu

Search