Prefix code: Difference between revisions

From formulasearchengine
Jump to navigation Jump to search
reference: UTF-8
 
en>David Eppstein
Prefix codes in use today: remove instruction set example — makes no sense (most microcomputer instruction sets are fixed length) and has been unsourced for two years
Line 1: Line 1:
It is the custom of the funeral. However, funerals can be stored in an urn or casket sprays ranging to $300. I really, really live by planning in an appropriate one includes fan-shape, triangular and baskets. Ida was preceded in death by his parents, husband and father, Louis Hirsch & Sons funeral home.<br><br>Also visit my weblog - [https://twitter.com/chiropraclando https://twitter.com/chiropraclando]
{{for|reinforcement learning in psychology|Reinforcement}}
{{Machine learning bar}}
 
'''Reinforcement learning''' is an area of [[machine learning]] inspired by [[Behaviorism|behaviorist psychology]], concerned with how [[software agent]]s ought to take ''actions'' in an ''environment'' so as to maximize some notion of cumulative ''reward''.  The problem, due to its generality, is studied  in many other disciplines, such as [[game theory]], [[control theory]], [[operations research]], [[information theory]], [[simulation-based optimization]], [[statistics]], and [[genetic algorithm]]s. In the operations research and control literature, the field where reinforcement learning methods are studied is called ''approximate dynamic programming''. The problem has been studied in the [[optimal control theory|theory of optimal control]], though most studies there are concerned with existence of optimal solutions and their characterization, and not with the learning or approximation aspects.
In [[economics]] and [[game theory]], reinforcement learning may be used to explain how equilibrium may arise under [[bounded rationality]].
 
In machine learning, the environment is typically formulated as a [[Markov decision process]] (MDP), and many reinforcement learning algorithms for this context are highly related to [[dynamic programming]] techniques. The main difference between the classical techniques and reinforcement learning algorithms is that the latter do not need knowledge about the MDP and they target large MDPs where exact methods become infeasible.
 
Reinforcement learning differs from standard [[supervised learning]] in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the [[multi-armed bandit]] problem and in finite MDPs.
 
== Introduction ==
The basic reinforcement learning model consists of:
 
# a set of environment states <math>S</math>;
# a set of actions <math>A</math>;
# rules of transitioning between states;
# rules that determine the ''scalar immediate reward'' of a transition; and
# rules that describe what the agent observes.
 
The rules are often [[stochastic]]. The observation typically involves the scalar immediate reward associated with the last transition.
In many works, the agent is also assumed to observe the current environmental state, in which case we talk about ''full observability'', whereas in the opposing case we talk about ''partial observability''. Sometimes the set of actions available to the agent is restricted (e.g., you cannot spend more money than what you possess).
 
A reinforcement learning agent interacts with its environment in discrete time steps.
At each time <math>t</math>, the agent receives an observation <math>o_t</math>, which typically includes the reward <math>r_t</math>.
It then chooses an action <math>a_t</math> from the set of actions available, which is subsequently sent to the environment.
The environment moves to a new state <math>s_{t+1}</math> and the reward <math>r_{t+1}</math> associated with the ''transition'' <math>(s_t,a_t,s_{t+1})</math> is determined.
The goal of a reinforcement learning agent is to collect as much reward as possible. The [[Software agent|agent]] can choose any action as a function of the history and it can even randomize its action selection.
 
When the agent's performance is compared to that of an agent which acts optimally from the beginning, the difference in performance gives rise to the notion of ''regret''.
Note that in order to act near optimally, the agent must reason about the long term consequences of its actions: In order to maximize my future income I had better go to school now, although the immediate monetary reward associated with this might be negative.
 
Thus, reinforcement learning is particularly well suited to problems which include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including [[robot control]], elevator scheduling, [[telecommunications]], [[backgammon]] and [[checkers]] ([[#refSutton1998|Sutton and Barto 1998]], Chapter 11).
 
Two components make reinforcement learning powerful:
The use of samples to optimize performance and the use of function approximation to deal with large environments.
Thanks to these two key components, reinforcement learning can be used in large environments in any of the following situations:
* A model of the environment is known, but an analytic solution is not available;
* Only a simulation model of the environment is given (the subject of [[simulation-based optimization]]);
* The only way to collect information about the environment is by interacting with it.
The first two of these problems could be considered planning problems (since some form of the model is available), while the last one could be considered as a genuine learning problem. However, under a reinforcement learning methodology both planning problems would be converted to [[machine learning]] problems.
 
== Exploration ==
 
The reinforcement learning problem as described requires clever exploration mechanisms. Randomly selecting actions, without reference to an estimated probability distribution, is known to give rise to very poor performance. The case of (small) finite MDPs is relatively well understood by now.
However, due to the lack of algorithms that would provably scale well with the number of states (or scale to problems with infinite state spaces), in practice people resort to simple exploration methods. One such method is <math>\epsilon</math>-greedy, when the agent chooses the action that it believes has the best long-term effect with probability <math>1-\epsilon</math>, and it chooses an action uniformly at random, otherwise. Here, <math>0<\epsilon<1</math> is a tuning parameter, which is sometimes changed, either according to a fixed schedule (making the agent explore less as time goes by), or adaptively based on some heuristics [[#refTokic2010|(Tokic & Palm, 2011)]].
 
== Algorithms for control learning ==
 
Even if the issue of exploration is disregarded and even if the state was observable (which we assume from now on), the problem remains to find out which actions are good based on past experience.
 
=== Criterion of optimality ===
For simplicity, assume for a moment that the problem studied is ''episodic'', an episode ending when some ''terminal state'' is reached. Assume further that no matter what course of actions the agent takes, termination is inevitable. Under some additional mild regularity conditions the expectation of the total reward is then well-defined, for ''any'' policy and any initial distribution over the states. Here, a policy refers to a mapping that assigns some probability distribution over the actions to all possible histories.
 
Given a fixed initial distribution <math>\mu</math>, we can thus assign the expected return <math>\rho^\pi</math> to policy <math>\pi</math>:
 
:<math>\rho^\pi = E[R|\pi],</math>
 
where the random variable <math>R</math> denotes the ''return'' and is defined by
 
:<math>R=\sum_{t=0}^{N-1} r_{t+1},</math>
 
where <math>r_{t+1}</math> is the reward received after the <math>t</math>-th transition, the initial state is sampled at random from <math>\mu</math> and actions are selected by policy <math>\pi</math>. Here,  <math>N</math> denotes the (random) time when a terminal state is reached, i.e., the time when the episode terminates.
 
In the case of non-episodic problems the return is often ''discounted'',
 
:<math>R=\sum_{t=0}^\infty \gamma^t r_{t+1},</math>
 
giving rise to the total expected discounted reward criterion. Here <math>0 \le \gamma \le 1</math> is the so-called ''discount-factor''. Since the undiscounted return is a special case of the discounted return, from now on we will assume discounting. Although this looks innocent enough, discounting is in fact problematic if one cares about online performance. This is because discounting makes the initial time steps more important. Since a learning agent is likely to make mistakes during the first few steps after its "life" starts, no uninformed learning algorithm can achieve near-optimal performance under discounting even if the class of environments is restricted to that of finite MDPs. (This does not mean though that, given enough time, a learning agent cannot figure how to act near-optimally, if time was restarted.)
 
The problem then is to specify an algorithm that can be used to find a policy with maximum expected return.
From the theory of MDPs it is known that, without loss of generality, the search can be restricted to the set of the so-called ''stationary'' policies. A policy is called stationary if the action-distribution returned by it depends only on the last state visited (which is part of the observation history of the agent, by our simplifying assumption). In fact, the search can be further restricted to  ''deterministic'' stationary policies. A deterministic stationary policy is one which deterministically selects actions based on the current state. Since any such policy can be identified with a mapping from the set of states to the set of actions, these policies can be identified with such mappings with no loss of generality.
 
=== Brute force ===
 
The [[brute-force search|brute force]] approach entails the following two steps:
 
# For each possible policy, sample returns while following it
# Choose the policy with the largest expected return
 
One problem with this is that the number of policies can be extremely large, or even infinite. Another is that variance of the returns might be large, in which case a large number of samples will be required to accurately estimate the return of each policy.
 
These problems can be ameliorated if we assume some structure and perhaps allow samples generated from one policy to influence the estimates made for another. The two main approaches for achieving this are [[value function approaches|value function estimation]] and [[direct policy search]].
 
=== Value function approaches ===
 
Value function approaches attempt to find a policy that maximizes the return by maintaining a set of estimates of expected returns for some policy (usually either the "current" or the optimal one).
 
These methods rely on the theory of MDPs, where optimality is defined in a sense which is stronger than the above one: A policy is called optimal if it achieves the best expected return from ''any'' initial state (i.e., initial distributions play no role in this definition). Again, one can always find an optimal policy amongst stationary policies.
 
To define optimality in a formal manner, define the value of a policy <math>\pi</math> by
 
: <math> V^{\pi} (s) = E[R|s,\pi],</math>
 
where <math>R</math> stands for the random return associated with following <math>\pi</math> from the initial state <math>s</math>.
Define <math>V^*(s)</math> as the maximum possible value of <math>V^\pi(s)</math>, where <math>\pi</math> is allowed to change:
 
: <math>V^*(s) = \sup \limits_\pi V^{\pi}(s).</math>
 
A policy which achieves these ''optimal values'' in ''each'' state is called ''optimal''. Clearly, a policy optimal in this strong sense is also optimal in the sense that it maximizes the expected return <math>\rho^\pi</math>, since <math>\rho^\pi = E[ V^\pi(S) ]</math>, where <math>S</math> is a state randomly sampled from the distribution <math>\mu</math>.
 
Although state-values suffice to define optimality, it will prove to be useful to define action-values. Given a state <math>s</math>, an action <math>a</math> and a policy <math>\pi</math>, the action-value of the pair <math>(s,a)</math> under <math>\pi</math> is defined by
 
:<math>Q^\pi(s,a) = E[R|s,a,\pi],\,</math>
 
where, now, <math>R</math> stands for the random return associated with first taking action <math>a</math> in state <math>s</math> and following <math>\pi</math>, thereafter.
 
It is well-known from the theory of MDPs that if someone gives us <math>Q</math> for an optimal policy, we can always choose optimal actions (and thus act optimally) by simply choosing the action with the highest value at each state.
The ''action-value function'' of such an optimal policy is called the ''optimal action-value function'' and is denoted by <math>Q^*</math>.
In summary, the knowledge of the optimal action-value function ''alone'' suffices to know how to act optimally.
 
Assuming full knowledge of the MDP, there are two basic approaches to compute the optimal action-value function, [[value iteration]] and [[policy iteration]].
Both algorithms compute a sequence of functions <math>Q_k</math> (<math>k=0,1,2,\ldots,</math>) which converge to <math>Q^*</math>.
Computing these functions involves computing expectations over the whole state-space, which is impractical for all, but the smallest (finite) MDPs, never mind the case when the MDP is unknown.
In reinforcement learning methods the expectations are approximated by averaging over samples and one uses function approximation techniques to cope with the need to represent value functions over large state-action spaces.
 
====Monte Carlo methods====
 
The simplest [[Monte Carlo sampling|Monte Carlo methods]] can be used in an algorithm that mimics policy iteration.
Policy iteration consists of two steps: ''policy evaluation'' and ''policy improvement''.
The Monte Carlo methods are used in the policy evaluation step.
In this step, given a stationary, deterministic policy <math>\pi</math>, the goal is to compute the function values <math>Q^\pi(s,a)</math> (or a good approximation to them) for all state-action pairs <math>(s,a)</math>.  
Assume (for simplicity) that the MDP is finite and in fact a table representing the action-values fits into the memory.
Further, assume that the problem is episodic and after each episode a new one starts from some random initial state.
Then, the estimate of the value of a given state-action pair <math>(s,a)</math>can be computed by simply averaging the sampled returns which originated from <math>(s,a)</math> over time.
Given enough time, this procedure can thus construct a precise estimate <math>Q</math> of the action-value function <math>Q^\pi</math>.
This finishes the description of the policy evaluation step.
In the policy improvement step, as it is done in the standard policy iteration algorithm, the next policy is obtained by computing a ''greedy'' policy with respect to <math>Q</math>: Given a state <math>s</math>, this new policy returns an action that maximizes <math>Q(s,\cdot)</math>. In practice one often avoids computing and storing the new policy, but uses [[lazy evaluation]] to defer the computation of the maximizing actions to when they are actually needed.
 
A few problems with this procedure are as follows:
 
* The procedure may waste too much time on evaluating a suboptimal policy;
* It uses samples inefficiently in that a long trajectory is  used to improve the estimate only of the ''single'' state-action pair that started the trajectory;
* When the returns along the trajectories have ''high variance'', convergence will be slow;
* It works in ''episodic problems only'';
* It works in ''small, finite MDPs only''.
 
====Temporal difference methods====
 
The first issue is easily corrected by allowing the procedure to change the policy (at all, or at some states) before the values settle. However good this sounds, this may be dangerous as this might prevent convergence. Still, most current algorithms implement this idea, giving rise to the class of ''generalized policy iteration'' algorithm. We note in passing that [[actor critic]] methods belong to this category.
 
The second issue can be corrected within the algorithm by allowing trajectories to contribute to any state-action pair in them.
This may also help to some extent with the third problem, although a better solution when returns have high variance is to use [[#refSutton84,#refSutton88|Sutton]]'s [[temporal difference]] (TD) methods which are based on the recursive [[Bellman equations|Bellman equation]]. Note that the computation in TD methods can be incremental (when after each transition the memory is changed and the transition is thrown away), or batch (when the transitions are collected and then the estimates are computed once based on a large number of transitions). Batch methods, a prime example of which is the [[least-squares temporal difference method]] due to [[#refBradtke1996|Bradtke and Barto (1996)]], may use the information in the samples better, whereas incremental methods are the only choice when batch methods become infeasible due to their high computational or memory complexity. In addition, there exist methods that try to unify the advantages of the two approaches. Methods based on temporal differences also overcome the second but last issue.
 
In order to address the last issue mentioned in the previous section, ''function approximation methods'' are used.
In ''linear function approximation'' one starts with a mapping <math>\phi</math> that assigns a finite dimensional vector to each state-action pair. Then, the action values of a state-action pair <math>(s,a)</math> are obtained by linearly combining the components of <math>\phi(s,a)</math> with some ''weights'' <math>\theta</math>:
 
:<math>Q(s,a) = \sum \limits_{i=1}^d \theta_i \phi_i(s,a)</math>.
 
The algorithms then adjust the weights, instead of adjusting the values associated with the individual state-action pairs.
However, linear function approximation is not the only choice.
More recently, methods based on ideas from [[nonparametric statistics]] (which can be seen to construct their own features) have been explored.
 
So far, the discussion was restricted to  how policy iteration can be used as a basis of the designing reinforcement learning algorithms. Equally importantly, value iteration can also be used as a starting point, giving rise to the [[Q-Learning]] algorithm [[#refWatkins1989|(Watkins 1989)]] and its many variants.
 
The problem with methods that use action-values is that they may need highly precise estimates of the competing action values, which can be hard to obtain when the returns are noisy. Though this problem is mitigated to some extent by temporal difference methods and if one uses the so-called compatible function approximation method, more work remains to be done to increase generality and efficiency. Another problem specific to temporal difference methods comes from their reliance on the recursive Bellman equation. Most temporal difference methods have a so-called <math>\lambda</math> parameter <math>(0\le \lambda\le 1)</math> that allows one to continuously interpolate between Monte-Carlo methods (which do not rely on the Bellman equations) and the basic temporal difference methods (which rely entirely on the Bellman equations), which can thus be effective in palliating this issue.
 
=== Direct policy search ===
 
An alternative method to find a good policy is to search directly in (some subset) of the policy space, in which case the problem becomes an instance of [[stochastic optimization]]. The two approaches available are gradient-based and gradient-free methods.
 
Gradient-based methods (giving rise to the so-called ''policy gradient methods'') start with a mapping from a finite dimensional (parameter) space to the space of policies: given the parameter vector <math>\theta</math>, let <math>\pi_\theta</math> denote the policy associated to <math>\theta</math>.
Define the performance function by
 
:<math>\rho(\theta) = \rho^{\pi_\theta}.</math>
 
Under mild conditions this function will be differentiable as a function of the parameter vector <math>\theta</math>.
If the gradient of <math>\rho</math> was known, one could use [[gradient descent|gradient ascent]].
Since an analytic expression for the gradient is not available, one must rely on a noisy estimate.
Such an estimate can be constructed in many ways, giving rise to algorithms like [[#refWilliams1987|Williams']] REINFORCE method (which is also known as the [[likelihood ratio method]] in the [[simulation-based optimization]] literature).
Policy gradient methods have received a lot of attention in the last couple of years (e.g., [[#refPeters2003|Peters et al. (2003)]]), but they remain an active field. 
The issue with many of these  methods is that they may get stuck in local optima (as they are based on [[Local search (optimization)|local search]]).
 
A large class of methods  avoids relying on gradient information.
These include [[simulated annealing]], [[cross-entropy method|cross-entropy search]] or methods of [[evolutionary computation]].
Many gradient-free methods can achieve (in theory and in the limit) a global optimum.
In a number of cases they have indeed demonstrated remarkable performance.
 
The issue with policy search methods is that they may converge slowly if the information based on which they act is noisy.
For example, this happens when in episodic problems the trajectories are long and the variance of the returns is large. As argued beforehand, value-function based methods that rely on temporal differences might help in this case. In recent years, several actor-critic algorithms have been proposed following this idea and were demonstrated to perform well in various problems.
 
== Theory ==
 
The theory for small, finite MDPs is quite mature.
Both the asymptotic and finite-sample behavior of most algorithms is well-understood.
As mentioned beforehand, algorithms with provably good online performance (addressing the exploration issue) are known.
The theory of large MDPs needs more work. Efficient exploration is largely untouched (except for the case of bandit problems).
Although  finite-time performance bounds appeared for many algorithms in the recent years, these bounds are expected to be rather loose and thus more work is needed to better understand the relative advantages, as well as the limitations of these algorithms.
For incremental algorithm asymptotic convergence issues have been settled. Recently, new incremental, temporal-difference-based algorithms have appeared which converge under a much wider set of conditions than was previously possible (for example, when used with arbitrary, smooth function approximation).
 
== Current research ==
 
Current research topics include:
adaptive methods which work with fewer (or no) parameters under a large number of conditions,
addressing the exploration problem in large MDPs,
large-scale empirical evaluations,
learning and acting under [[Partially observable Markov decision process|partial information]] (e.g., using [[Predictive State Representation]]),
modular and hierarchical reinforcement learning,
improving existing value-function and policy search methods,
algorithms that work well with large (or continuous) action spaces,
transfer learning,
lifelong learning,
efficient sample-based planning (e.g., based on [[Monte-Carlo tree search]]).
Multiagent or Distributed Reinforcement Learning is also a topic of interest in current research.
There is also a growing interest in real life applications of reinforcement learning.
Successes of reinforcement learning are collected on
[http://umichrl.pbworks.com/Successes-of-Reinforcement-Learning/ here] and
[http://rl-community.org/wiki/Successes_Of_RL here].
 
Reinforcement learning algorithms such as TD learning are also being investigated as a model for [[Dopamine]]-based learning in the brain. In this model, the dopaminergic projections from the substantia nigra to the basal ganglia function as the prediction error. Reinforcement learning has also been used as a part of the model for human skill learning, especially in relation to the interaction between implicit and explicit learning in skill acquisition (the first publication on this application was in 1995-1996, and there have been many follow-up studies). See http://webdocs.cs.ualberta.ca/~sutton/RL-FAQ.html#behaviorism for further details of these research areas above.
 
== Literature ==
 
=== Conferences, journals ===
Most reinforcement learning papers are published at the major machine learning and AI conferences (ICML, NIPS, AAAI, IJCAI, UAI, AI and Statistics) and journals ([http://www.jair.org JAIR], [http://www.jmlr.org JMLR], [http://www.springer.com/computer/ai/journal/10994 Machine learning journal]). Some theory papers are published at COLT and ALT. However, many papers appear in robotics conferences (IROS, ICRA) and the "agent" conference AAMAS. Operations researchers publish their papers at the INFORMS conference and, for example, in the [http://or.pubs.informs.org Operation Research], and the [http://mor.pubs.informs.org Mathematics of Operations Research] journals. Control researchers publish their papers at the CDC and ACC conferences, or, e.g., in the journals [http://www.nd.edu/~ieeetac/ IEEE Transactions on Automatic Control], or [http://www.elsevier.com/locate/automatica Automatica], although applied works tend to be published in more specialized journals. The [http://www.wintersim.org/ Winter Simulation Conference] also publishes many relevant papers. Other than this, papers also published in the major conferences of the neural networks, fuzzy, and evolutionary computation communities. The annual IEEE symposium titled Approximate Dynamic Programming and Reinforcement Learning (ADPRL) and the biannual European Workshop on Reinforcement Learning ([http://ewrl.wordpress.com/ EWRL]) are two regularly held meetings where RL researchers meet.
 
== See also ==
* [[Temporal difference learning]]
* [[Q-learning]]
* [[SARSA]]
* [[Fictitious play]]
* [[Learning classifier system]]
* [[Optimal control]]
* [[Dynamic treatment regimes]]
* [[Error-driven learning]]
 
==Implementations==
* [http://glue.rl-community.org/ RL-Glue] provides a standard interface that allows you to connect agents, environments, and experiment programs together, even if they are written in different languages.
* [http://mmlf.sourceforge.net/ Maja Machine Learning Framework] The Maja Machine Learning Framework (MMLF) is a general framework for problems in the domain of Reinforcement Learning (RL) written in python.
* [http://jamh-web.appspot.com/download.htm Software Tools for Reinforcement Learning (Matlab and Python)]
* [http://www.pybrain.org/ PyBrain(Python)]
* [http://servicerobotik.hs-weingarten.de/en/teachingbox.php TeachingBox] is a Java reinforcement learning framework supporting many features like RBF networks, gradient descent learning methods, ...
* [http://homepages.cwi.nl/%7Ehasselt/code.html C++ implementation] for some well known reinforcement learning algorithms with source.
* [[Orange (software)|Orange]], a free data mining software suite, module [http://www.ailab.si/orange/doc/modules/orngReinforcement.htm orngReinforcement]
* [http://www.ias.informatik.tu-darmstadt.de/Research/PolicyGradientToolbox Policy Gradient Toolbox] provides a package for learning about policy gradient approaches.
 
== References ==
 
* {{Cite thesis
  | last = Sutton | first = Richard S. | authorlink = Richard S. Sutton
  | degree= PhD
  | title= Temporal Credit Assignment in Reinforcement Learning
  | year= 1984
  | school = University of Massachusetts, Amherst, MA
  | url= http://webdocs.cs.ualberta.ca/~sutton/publications.html#PhDthesis}}
 
* {{cite conference
  | last = Williams | first = Ronald J. | authorlink = Ronald J. Williams
  | title = A class of gradient-estimating algorithms for reinforcement learning in neural networks
  | booktitle = Proceedings of the IEEE First International Conference on Neural Networks
  | year = 1987
  | url = http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.129.8871}}
 
* {{cite journal
  | doi = 10.1007/BF00115009
  | last = Sutton | first = Richard S. | authorlink = Richard S. Sutton
  | title = Learning to predict by the method of temporal differences
  | journal = Machine Learning
  | volume = 3
  | pages = 9&ndash;44
  | publisher = Springer
  | year = 1988
  | url = http://webdocs.cs.ualberta.ca/~sutton/publications.html#TD_paper}}
 
* {{Cite thesis
  | last = Watkins | first = Christopher J.C.H. | authorlink = Christopher J.C.H.  Watkins
  | degree= PhD
  | title= Learning from Delayed Rewards
  | year= 1989
  | school = King’s College, Cambridge, UK
  | url= http://www.cs.rhul.ac.uk/~chrisw/new_thesis.pdf}}
 
* {{cite journal
  | doi = 10.1023/A:1018056104778
  | last = Bradtke | first = Steven J. | authorlink = Steven J. Bradtke
  | coauthors = [[Andrew G. Barto]]
  | title = Learning to predict by the method of temporal differences
  | journal = Machine Learning
  | volume = 22
  | pages = 33&ndash;57
  | publisher = Springer
  | year = 1996
  | url = http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.143.857}}
 
* {{cite book
  | last = Bertsekas | first = Dimitri P. | authorlink = Dimitri P. Bertsekas
  | coauthors = [[John Tsitsiklis]]
  | title = Neuro-Dynamic Programming
  | publisher = Athena Scientific
  | year = 1996
  | location = Nashua, NH
  | isbn = 1-886529-10-8
  | url = http://www.athenasc.com/ndpbook.html}}
 
* {{cite journal
  | last = Kaelbling | first = Leslie P. | authorlink = Leslie P. Kaelbling
  | coauthors = [[Michael L. Littman]]; [[Andrew W. Moore]]
  | title = Reinforcement Learning: A Survey
  | journal = Journal of Artificial Intelligence Research
  | volume = 4
  | pages = 237&ndash;285
  | publisher =
  | year = 1996
  | url = http://www.cs.washington.edu/research/jair/abstracts/kaelbling96a.html}}
 
* {{cite book
  | last = Sutton | first = Richard S. | authorlink = Richard S. Sutton
  | coauthors = [[Barto, Andrew G.]]
  | title = Reinforcement Learning: An Introduction
  | publisher = MIT Press
  | year = 1998
  | isbn = 0-262-19398-1
  | pages =
  | url = http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html}}
 
* {{cite conference
  | last = Peters | first = Jan | authorlink = Jan Peters (researcher)
  | coauthors = [[Sethu Vijayakumar]]; [[Stefan Schaal]]
  | title = Reinforcement Learning for Humanoid Robotics
  | booktitle = IEEE-RAS International Conference on Humanoid Robots
  | year = 2003
  | url = http://www-clmc.usc.edu/publications/p/peters-ICHR2003.pdf}}
 
* {{cite book
  | last = Powell | first = Warren
  | title = Approximate dynamic programming: solving the curses of dimensionality
  | year = 2007
  | publisher = Wiley-Interscience
  | isbn = 0-470-17155-3
  | url = http://www.castlelab.princeton.edu/adp.htm}}
 
* {{cite journal
  | last = Auer | first = Peter | authorlink = Peter Auer
  | coauthors = [[Thomas Jaksch]]; [[Ronald Ortner]]
  | title = Near-optimal regret bounds for reinforcement learning
  | journal = Journal of Machine Learning Research
  | volume = 11
  | pages = 1563&ndash;1600
  | publisher =
  | year = 2010
  | url = http://jmlr.csail.mit.edu/papers/v11/jaksch10a.html}}
 
* {{cite conference
  | last = Szita | first = Istvan | authorlink = Istvan Szita
  | title = Model-based Reinforcement Learning with Nearly Tight Exploration Complexity Bounds
  | coauthors = [[Csaba Szepesvari]]
  | year = 2010
  | publisher = Omnipress
  | booktitle = ICML 2010
  | pages = 1031–1038
  | url = http://www.icml2010.org/papers/546.pdf}}
 
* {{Cite book
  | last = Bertsekas | first = Dimitri P. | authorlink = Dimitri P. Bertsekas
  | title= Dynamic Programming and Optimal Control
  |date=August 2010
  | edition = 3
  | volume = II
  | chapter =  Chapter 6 (online): Approximate Dynamic Programming
  | url= http://web.mit.edu/dimitrib/www/dpchapter.pdf}}
 
* {{cite book
  | last = Busoniu | first = Lucian | authorlink = Lucian Busoniu
  | coauthors = [[Robert Babuska]] ; [[Bart De Schutter]] ; [[Damien Ernst]]
  | title = Reinforcement Learning and Dynamic Programming using Function Approximators 
  | publisher = Taylor & Francis CRC Press
  | year = 2010
  | isbn = 978-1-4398-2108-4
  | pages =
  | url = http://www.dcsc.tudelft.nl/rlbook/}}
 
*  {{Cite book
| publisher = Springer Berlin / Heidelberg
| volume = 7006
| pages = 335–346
| last = Tokic | first = Michel | authorlink = Michel Tokic
| coauthors = [[Günther Palm]] ;
| title = KI 2011: Advances in Artificial Intelligence
| chapter = Value-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax
| series = Lecture Notes in Computer Science
| url = http://www.tokic.com/www/tokicm/publikationen/papers/KI2011.pdf
| year = 2011
}}
 
== External links ==
* [http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html Website for ''Reinforcement Learning: An Introduction''] (1998), by Rich Sutton and Andrew Barto, MIT Press, including a link to an html version of the book.
* [http://www-anw.cs.umass.edu/rlr/ Reinforcement Learning Repository]
* [http://spaces.facsci.ualberta.ca/rlai/ Reinforcement Learning and Artificial Intelligence] (RLAI, Rich Sutton's lab at the University of Alberta)
* [http://www-all.cs.umass.edu/ Autonomous Learning Laboratory] (ALL, Andrew Barto's lab at the University of Massachusetts Amherst)
* [http://glue.rl-community.org RL-Glue]
* [http://jamh-web.appspot.com/download.htm Software Tools for Reinforcement Learning (Matlab and Python)]
* [http://www.igi.tugraz.at/ril-toolbox The Reinforcement Learning Toolbox from the (Graz University of Technology) ]
* [http://www.cogsci.rpi.edu/~rsun/hybrid-rl.html Hybrid reinforcement learning]
* [http://sourceforge.net/projects/piqle/  Piqle: a Generic Java Platform for Reinforcement Learning]
* [http://homepages.cwi.nl/~hasselt/rl_algs/rl_algs.html A Short Introduction To Some Reinforcement Learning Algorithms]
* [http://www.lwebzem.com/cgi-bin/ttt/ttt.html Reinforcement Learning applied to Tic-Tac-Toe Game]
* [http://www.scholarpedia.org/article/Reinforcement_Learning Scholarpedia Reinforcement Learning]
* [http://www.scholarpedia.org/article/Temporal_difference_learning Scholarpedia Temporal Difference Learning]
* [http://www.troovoo.com/vid.php?a=Stanford&c=Machine+Learning&l=Applications+of+Reinforcement+Learning Stanford Reinforcement Learning Course]
* [http://www.dcsc.tudelft.nl/~robotics/media.html Real-world reinforcement learning experiments] at [[Delft University of Technology]]
* [http://busoniu.net/repository.php Reinforcement Learning Tools for Matlab]
* [https://www.youtube.com/watch?v=RtxI449ZjSc&feature=relmfu Stanford University Andrew Ng Lecture on Reinforcement Learning]
* [http://www.ias.informatik.tu-darmstadt.de/uploads/Team/JanPeters/Peters2010_REPS.pdf Relative Entropy Policy Search]
 
[[Category:Markov models]]
[[Category:Machine learning algorithms]]
[[Category:Belief revision]]

Revision as of 02:42, 21 January 2014

28 year-old Painting Investments Worker Truman from Regina, usually spends time with pastimes for instance interior design, property developers in new launch ec Singapore and writing. Last month just traveled to City of the Renaissance. Genital herpes is a kind of sexually transmitted disease that certain becomes through sexual or oral connection with someone else that is afflicted by the viral disorder. Oral herpes requires occasional eruptions of fever blisters" round the mouth Figure 02 Also known as cold sores" or fever blisters," characteristic herpes lesions often appear around the mouth sometimes of illness, after sunlight or wind publicity, during menstruation, or with mental stress.

Though statistical numbers aren't nearly where they should be, increasing numbers of people are arriving at various clinics regarding the herpes symptoms also to have themselves and their companions treated.

Because symptoms may be recognised incorrectly as skin irritation or something else, a partner can't be determined by the partner with herpes to constantly find out when he or she is contagious. Some who contract herpes are symptom-no cost, others have just one breakout, and still others have standard bouts of symptoms.

Similarly, careful hand washing should be practiced to avoid the virus from spreading to other parts of the body, especially the eye and mouth. If you think you have already been exposed or show signs of herpes infection, see your medical provider. Prompt qualified diagnosis may boost your chances of responding to a prescription drugs like acyclovir that decreases the duration and severity of a short bout of symptoms.

HSV type 1 is the herpes virus that is usually responsible for cold sores of the mouth, the so-referred to as " fever blisters." You get HSV-1 by coming into contact with the saliva of an contaminated person.

If you are you looking for more information regarding herpes symptoms oral pictures look into our own web page.

Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. The problem, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, statistics, and genetic algorithms. In the operations research and control literature, the field where reinforcement learning methods are studied is called approximate dynamic programming. The problem has been studied in the theory of optimal control, though most studies there are concerned with existence of optimal solutions and their characterization, and not with the learning or approximation aspects. In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality.

In machine learning, the environment is typically formulated as a Markov decision process (MDP), and many reinforcement learning algorithms for this context are highly related to dynamic programming techniques. The main difference between the classical techniques and reinforcement learning algorithms is that the latter do not need knowledge about the MDP and they target large MDPs where exact methods become infeasible.

Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the multi-armed bandit problem and in finite MDPs.

Introduction

The basic reinforcement learning model consists of:

  1. a set of environment states S;
  2. a set of actions A;
  3. rules of transitioning between states;
  4. rules that determine the scalar immediate reward of a transition; and
  5. rules that describe what the agent observes.

The rules are often stochastic. The observation typically involves the scalar immediate reward associated with the last transition. In many works, the agent is also assumed to observe the current environmental state, in which case we talk about full observability, whereas in the opposing case we talk about partial observability. Sometimes the set of actions available to the agent is restricted (e.g., you cannot spend more money than what you possess).

A reinforcement learning agent interacts with its environment in discrete time steps. At each time t, the agent receives an observation ot, which typically includes the reward rt. It then chooses an action at from the set of actions available, which is subsequently sent to the environment. The environment moves to a new state st+1 and the reward rt+1 associated with the transition (st,at,st+1) is determined. The goal of a reinforcement learning agent is to collect as much reward as possible. The agent can choose any action as a function of the history and it can even randomize its action selection.

When the agent's performance is compared to that of an agent which acts optimally from the beginning, the difference in performance gives rise to the notion of regret. Note that in order to act near optimally, the agent must reason about the long term consequences of its actions: In order to maximize my future income I had better go to school now, although the immediate monetary reward associated with this might be negative.

Thus, reinforcement learning is particularly well suited to problems which include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including robot control, elevator scheduling, telecommunications, backgammon and checkers (Sutton and Barto 1998, Chapter 11).

Two components make reinforcement learning powerful: The use of samples to optimize performance and the use of function approximation to deal with large environments. Thanks to these two key components, reinforcement learning can be used in large environments in any of the following situations:

  • A model of the environment is known, but an analytic solution is not available;
  • Only a simulation model of the environment is given (the subject of simulation-based optimization);
  • The only way to collect information about the environment is by interacting with it.

The first two of these problems could be considered planning problems (since some form of the model is available), while the last one could be considered as a genuine learning problem. However, under a reinforcement learning methodology both planning problems would be converted to machine learning problems.

Exploration

The reinforcement learning problem as described requires clever exploration mechanisms. Randomly selecting actions, without reference to an estimated probability distribution, is known to give rise to very poor performance. The case of (small) finite MDPs is relatively well understood by now. However, due to the lack of algorithms that would provably scale well with the number of states (or scale to problems with infinite state spaces), in practice people resort to simple exploration methods. One such method is ϵ-greedy, when the agent chooses the action that it believes has the best long-term effect with probability 1ϵ, and it chooses an action uniformly at random, otherwise. Here, 0<ϵ<1 is a tuning parameter, which is sometimes changed, either according to a fixed schedule (making the agent explore less as time goes by), or adaptively based on some heuristics (Tokic & Palm, 2011).

Algorithms for control learning

Even if the issue of exploration is disregarded and even if the state was observable (which we assume from now on), the problem remains to find out which actions are good based on past experience.

Criterion of optimality

For simplicity, assume for a moment that the problem studied is episodic, an episode ending when some terminal state is reached. Assume further that no matter what course of actions the agent takes, termination is inevitable. Under some additional mild regularity conditions the expectation of the total reward is then well-defined, for any policy and any initial distribution over the states. Here, a policy refers to a mapping that assigns some probability distribution over the actions to all possible histories.

Given a fixed initial distribution μ, we can thus assign the expected return ρπ to policy π:

ρπ=E[R|π],

where the random variable R denotes the return and is defined by

R=t=0N1rt+1,

where rt+1 is the reward received after the t-th transition, the initial state is sampled at random from μ and actions are selected by policy π. Here, N denotes the (random) time when a terminal state is reached, i.e., the time when the episode terminates.

In the case of non-episodic problems the return is often discounted,

R=t=0γtrt+1,

giving rise to the total expected discounted reward criterion. Here 0γ1 is the so-called discount-factor. Since the undiscounted return is a special case of the discounted return, from now on we will assume discounting. Although this looks innocent enough, discounting is in fact problematic if one cares about online performance. This is because discounting makes the initial time steps more important. Since a learning agent is likely to make mistakes during the first few steps after its "life" starts, no uninformed learning algorithm can achieve near-optimal performance under discounting even if the class of environments is restricted to that of finite MDPs. (This does not mean though that, given enough time, a learning agent cannot figure how to act near-optimally, if time was restarted.)

The problem then is to specify an algorithm that can be used to find a policy with maximum expected return. From the theory of MDPs it is known that, without loss of generality, the search can be restricted to the set of the so-called stationary policies. A policy is called stationary if the action-distribution returned by it depends only on the last state visited (which is part of the observation history of the agent, by our simplifying assumption). In fact, the search can be further restricted to deterministic stationary policies. A deterministic stationary policy is one which deterministically selects actions based on the current state. Since any such policy can be identified with a mapping from the set of states to the set of actions, these policies can be identified with such mappings with no loss of generality.

Brute force

The brute force approach entails the following two steps:

  1. For each possible policy, sample returns while following it
  2. Choose the policy with the largest expected return

One problem with this is that the number of policies can be extremely large, or even infinite. Another is that variance of the returns might be large, in which case a large number of samples will be required to accurately estimate the return of each policy.

These problems can be ameliorated if we assume some structure and perhaps allow samples generated from one policy to influence the estimates made for another. The two main approaches for achieving this are value function estimation and direct policy search.

Value function approaches

Value function approaches attempt to find a policy that maximizes the return by maintaining a set of estimates of expected returns for some policy (usually either the "current" or the optimal one).

These methods rely on the theory of MDPs, where optimality is defined in a sense which is stronger than the above one: A policy is called optimal if it achieves the best expected return from any initial state (i.e., initial distributions play no role in this definition). Again, one can always find an optimal policy amongst stationary policies.

To define optimality in a formal manner, define the value of a policy π by

Vπ(s)=E[R|s,π],

where R stands for the random return associated with following π from the initial state s. Define V*(s) as the maximum possible value of Vπ(s), where π is allowed to change:

V*(s)=sup\limits πVπ(s).

A policy which achieves these optimal values in each state is called optimal. Clearly, a policy optimal in this strong sense is also optimal in the sense that it maximizes the expected return ρπ, since ρπ=E[Vπ(S)], where S is a state randomly sampled from the distribution μ.

Although state-values suffice to define optimality, it will prove to be useful to define action-values. Given a state s, an action a and a policy π, the action-value of the pair (s,a) under π is defined by

Qπ(s,a)=E[R|s,a,π],

where, now, R stands for the random return associated with first taking action a in state s and following π, thereafter.

It is well-known from the theory of MDPs that if someone gives us Q for an optimal policy, we can always choose optimal actions (and thus act optimally) by simply choosing the action with the highest value at each state. The action-value function of such an optimal policy is called the optimal action-value function and is denoted by Q*. In summary, the knowledge of the optimal action-value function alone suffices to know how to act optimally.

Assuming full knowledge of the MDP, there are two basic approaches to compute the optimal action-value function, value iteration and policy iteration. Both algorithms compute a sequence of functions Qk (k=0,1,2,,) which converge to Q*. Computing these functions involves computing expectations over the whole state-space, which is impractical for all, but the smallest (finite) MDPs, never mind the case when the MDP is unknown. In reinforcement learning methods the expectations are approximated by averaging over samples and one uses function approximation techniques to cope with the need to represent value functions over large state-action spaces.

Monte Carlo methods

The simplest Monte Carlo methods can be used in an algorithm that mimics policy iteration. Policy iteration consists of two steps: policy evaluation and policy improvement. The Monte Carlo methods are used in the policy evaluation step. In this step, given a stationary, deterministic policy π, the goal is to compute the function values Qπ(s,a) (or a good approximation to them) for all state-action pairs (s,a). Assume (for simplicity) that the MDP is finite and in fact a table representing the action-values fits into the memory. Further, assume that the problem is episodic and after each episode a new one starts from some random initial state. Then, the estimate of the value of a given state-action pair (s,a)can be computed by simply averaging the sampled returns which originated from (s,a) over time. Given enough time, this procedure can thus construct a precise estimate Q of the action-value function Qπ. This finishes the description of the policy evaluation step. In the policy improvement step, as it is done in the standard policy iteration algorithm, the next policy is obtained by computing a greedy policy with respect to Q: Given a state s, this new policy returns an action that maximizes Q(s,). In practice one often avoids computing and storing the new policy, but uses lazy evaluation to defer the computation of the maximizing actions to when they are actually needed.

A few problems with this procedure are as follows:

  • The procedure may waste too much time on evaluating a suboptimal policy;
  • It uses samples inefficiently in that a long trajectory is used to improve the estimate only of the single state-action pair that started the trajectory;
  • When the returns along the trajectories have high variance, convergence will be slow;
  • It works in episodic problems only;
  • It works in small, finite MDPs only.

Temporal difference methods

The first issue is easily corrected by allowing the procedure to change the policy (at all, or at some states) before the values settle. However good this sounds, this may be dangerous as this might prevent convergence. Still, most current algorithms implement this idea, giving rise to the class of generalized policy iteration algorithm. We note in passing that actor critic methods belong to this category.

The second issue can be corrected within the algorithm by allowing trajectories to contribute to any state-action pair in them. This may also help to some extent with the third problem, although a better solution when returns have high variance is to use Sutton's temporal difference (TD) methods which are based on the recursive Bellman equation. Note that the computation in TD methods can be incremental (when after each transition the memory is changed and the transition is thrown away), or batch (when the transitions are collected and then the estimates are computed once based on a large number of transitions). Batch methods, a prime example of which is the least-squares temporal difference method due to Bradtke and Barto (1996), may use the information in the samples better, whereas incremental methods are the only choice when batch methods become infeasible due to their high computational or memory complexity. In addition, there exist methods that try to unify the advantages of the two approaches. Methods based on temporal differences also overcome the second but last issue.

In order to address the last issue mentioned in the previous section, function approximation methods are used. In linear function approximation one starts with a mapping ϕ that assigns a finite dimensional vector to each state-action pair. Then, the action values of a state-action pair (s,a) are obtained by linearly combining the components of ϕ(s,a) with some weights θ:

Q(s,a)=i=1dθiϕi(s,a).

The algorithms then adjust the weights, instead of adjusting the values associated with the individual state-action pairs. However, linear function approximation is not the only choice. More recently, methods based on ideas from nonparametric statistics (which can be seen to construct their own features) have been explored.

So far, the discussion was restricted to how policy iteration can be used as a basis of the designing reinforcement learning algorithms. Equally importantly, value iteration can also be used as a starting point, giving rise to the Q-Learning algorithm (Watkins 1989) and its many variants.

The problem with methods that use action-values is that they may need highly precise estimates of the competing action values, which can be hard to obtain when the returns are noisy. Though this problem is mitigated to some extent by temporal difference methods and if one uses the so-called compatible function approximation method, more work remains to be done to increase generality and efficiency. Another problem specific to temporal difference methods comes from their reliance on the recursive Bellman equation. Most temporal difference methods have a so-called λ parameter (0λ1) that allows one to continuously interpolate between Monte-Carlo methods (which do not rely on the Bellman equations) and the basic temporal difference methods (which rely entirely on the Bellman equations), which can thus be effective in palliating this issue.

Direct policy search

An alternative method to find a good policy is to search directly in (some subset) of the policy space, in which case the problem becomes an instance of stochastic optimization. The two approaches available are gradient-based and gradient-free methods.

Gradient-based methods (giving rise to the so-called policy gradient methods) start with a mapping from a finite dimensional (parameter) space to the space of policies: given the parameter vector θ, let πθ denote the policy associated to θ. Define the performance function by

ρ(θ)=ρπθ.

Under mild conditions this function will be differentiable as a function of the parameter vector θ. If the gradient of ρ was known, one could use gradient ascent. Since an analytic expression for the gradient is not available, one must rely on a noisy estimate. Such an estimate can be constructed in many ways, giving rise to algorithms like Williams' REINFORCE method (which is also known as the likelihood ratio method in the simulation-based optimization literature). Policy gradient methods have received a lot of attention in the last couple of years (e.g., Peters et al. (2003)), but they remain an active field. The issue with many of these methods is that they may get stuck in local optima (as they are based on local search).

A large class of methods avoids relying on gradient information. These include simulated annealing, cross-entropy search or methods of evolutionary computation. Many gradient-free methods can achieve (in theory and in the limit) a global optimum. In a number of cases they have indeed demonstrated remarkable performance.

The issue with policy search methods is that they may converge slowly if the information based on which they act is noisy. For example, this happens when in episodic problems the trajectories are long and the variance of the returns is large. As argued beforehand, value-function based methods that rely on temporal differences might help in this case. In recent years, several actor-critic algorithms have been proposed following this idea and were demonstrated to perform well in various problems.

Theory

The theory for small, finite MDPs is quite mature. Both the asymptotic and finite-sample behavior of most algorithms is well-understood. As mentioned beforehand, algorithms with provably good online performance (addressing the exploration issue) are known. The theory of large MDPs needs more work. Efficient exploration is largely untouched (except for the case of bandit problems). Although finite-time performance bounds appeared for many algorithms in the recent years, these bounds are expected to be rather loose and thus more work is needed to better understand the relative advantages, as well as the limitations of these algorithms. For incremental algorithm asymptotic convergence issues have been settled. Recently, new incremental, temporal-difference-based algorithms have appeared which converge under a much wider set of conditions than was previously possible (for example, when used with arbitrary, smooth function approximation).

Current research

Current research topics include: adaptive methods which work with fewer (or no) parameters under a large number of conditions, addressing the exploration problem in large MDPs, large-scale empirical evaluations, learning and acting under partial information (e.g., using Predictive State Representation), modular and hierarchical reinforcement learning, improving existing value-function and policy search methods, algorithms that work well with large (or continuous) action spaces, transfer learning, lifelong learning, efficient sample-based planning (e.g., based on Monte-Carlo tree search). Multiagent or Distributed Reinforcement Learning is also a topic of interest in current research. There is also a growing interest in real life applications of reinforcement learning. Successes of reinforcement learning are collected on here and here.

Reinforcement learning algorithms such as TD learning are also being investigated as a model for Dopamine-based learning in the brain. In this model, the dopaminergic projections from the substantia nigra to the basal ganglia function as the prediction error. Reinforcement learning has also been used as a part of the model for human skill learning, especially in relation to the interaction between implicit and explicit learning in skill acquisition (the first publication on this application was in 1995-1996, and there have been many follow-up studies). See http://webdocs.cs.ualberta.ca/~sutton/RL-FAQ.html#behaviorism for further details of these research areas above.

Literature

Conferences, journals

Most reinforcement learning papers are published at the major machine learning and AI conferences (ICML, NIPS, AAAI, IJCAI, UAI, AI and Statistics) and journals (JAIR, JMLR, Machine learning journal). Some theory papers are published at COLT and ALT. However, many papers appear in robotics conferences (IROS, ICRA) and the "agent" conference AAMAS. Operations researchers publish their papers at the INFORMS conference and, for example, in the Operation Research, and the Mathematics of Operations Research journals. Control researchers publish their papers at the CDC and ACC conferences, or, e.g., in the journals IEEE Transactions on Automatic Control, or Automatica, although applied works tend to be published in more specialized journals. The Winter Simulation Conference also publishes many relevant papers. Other than this, papers also published in the major conferences of the neural networks, fuzzy, and evolutionary computation communities. The annual IEEE symposium titled Approximate Dynamic Programming and Reinforcement Learning (ADPRL) and the biannual European Workshop on Reinforcement Learning (EWRL) are two regularly held meetings where RL researchers meet.

See also

Implementations

References

  • 55 years old Systems Administrator Antony from Clarence Creek, really loves learning, PC Software and aerobics. Likes to travel and was inspired after making a journey to Historic Ensemble of the Potala Palace.

    You can view that web-site... ccleaner free download
  • One of the biggest reasons investing in a Singapore new launch is an effective things is as a result of it is doable to be lent massive quantities of money at very low interest rates that you should utilize to purchase it. Then, if property values continue to go up, then you'll get a really high return on funding (ROI). Simply make sure you purchase one of the higher properties, reminiscent of the ones at Fernvale the Riverbank or any Singapore landed property Get Earnings by means of Renting

    In its statement, the singapore property listing - website link, government claimed that the majority citizens buying their first residence won't be hurt by the new measures. Some concessions can even be prolonged to chose teams of consumers, similar to married couples with a minimum of one Singaporean partner who are purchasing their second property so long as they intend to promote their first residential property. Lower the LTV limit on housing loans granted by monetary establishments regulated by MAS from 70% to 60% for property purchasers who are individuals with a number of outstanding housing loans on the time of the brand new housing purchase. Singapore Property Measures - 30 August 2010 The most popular seek for the number of bedrooms in Singapore is 4, followed by 2 and three. Lush Acres EC @ Sengkang

    Discover out more about real estate funding in the area, together with info on international funding incentives and property possession. Many Singaporeans have been investing in property across the causeway in recent years, attracted by comparatively low prices. However, those who need to exit their investments quickly are likely to face significant challenges when trying to sell their property – and could finally be stuck with a property they can't sell. Career improvement programmes, in-house valuation, auctions and administrative help, venture advertising and marketing, skilled talks and traisning are continuously planned for the sales associates to help them obtain better outcomes for his or her shoppers while at Knight Frank Singapore. No change Present Rules

    Extending the tax exemption would help. The exemption, which may be as a lot as $2 million per family, covers individuals who negotiate a principal reduction on their existing mortgage, sell their house short (i.e., for lower than the excellent loans), or take part in a foreclosure course of. An extension of theexemption would seem like a common-sense means to assist stabilize the housing market, but the political turmoil around the fiscal-cliff negotiations means widespread sense could not win out. Home Minority Chief Nancy Pelosi (D-Calif.) believes that the mortgage relief provision will be on the table during the grand-cut price talks, in response to communications director Nadeam Elshami. Buying or promoting of blue mild bulbs is unlawful.

    A vendor's stamp duty has been launched on industrial property for the primary time, at rates ranging from 5 per cent to 15 per cent. The Authorities might be trying to reassure the market that they aren't in opposition to foreigners and PRs investing in Singapore's property market. They imposed these measures because of extenuating components available in the market." The sale of new dual-key EC models will even be restricted to multi-generational households only. The models have two separate entrances, permitting grandparents, for example, to dwell separately. The vendor's stamp obligation takes effect right this moment and applies to industrial property and plots which might be offered inside three years of the date of buy. JLL named Best Performing Property Brand for second year running

    The data offered is for normal info purposes only and isn't supposed to be personalised investment or monetary advice. Motley Fool Singapore contributor Stanley Lim would not personal shares in any corporations talked about. Singapore private home costs increased by 1.eight% within the fourth quarter of 2012, up from 0.6% within the earlier quarter. Resale prices of government-built HDB residences which are usually bought by Singaporeans, elevated by 2.5%, quarter on quarter, the quickest acquire in five quarters. And industrial property, prices are actually double the levels of three years ago. No withholding tax in the event you sell your property. All your local information regarding vital HDB policies, condominium launches, land growth, commercial property and more

    There are various methods to go about discovering the precise property. Some local newspapers (together with the Straits Instances ) have categorised property sections and many local property brokers have websites. Now there are some specifics to consider when buying a 'new launch' rental. Intended use of the unit Every sale begins with 10 p.c low cost for finish of season sale; changes to 20 % discount storewide; follows by additional reduction of fiftyand ends with last discount of 70 % or extra. Typically there is even a warehouse sale or transferring out sale with huge mark-down of costs for stock clearance. Deborah Regulation from Expat Realtor shares her property market update, plus prime rental residences and houses at the moment available to lease Esparina EC @ Sengkang
  • One of the biggest reasons investing in a Singapore new launch is an effective things is as a result of it is doable to be lent massive quantities of money at very low interest rates that you should utilize to purchase it. Then, if property values continue to go up, then you'll get a really high return on funding (ROI). Simply make sure you purchase one of the higher properties, reminiscent of the ones at Fernvale the Riverbank or any Singapore landed property Get Earnings by means of Renting

    In its statement, the singapore property listing - website link, government claimed that the majority citizens buying their first residence won't be hurt by the new measures. Some concessions can even be prolonged to chose teams of consumers, similar to married couples with a minimum of one Singaporean partner who are purchasing their second property so long as they intend to promote their first residential property. Lower the LTV limit on housing loans granted by monetary establishments regulated by MAS from 70% to 60% for property purchasers who are individuals with a number of outstanding housing loans on the time of the brand new housing purchase. Singapore Property Measures - 30 August 2010 The most popular seek for the number of bedrooms in Singapore is 4, followed by 2 and three. Lush Acres EC @ Sengkang

    Discover out more about real estate funding in the area, together with info on international funding incentives and property possession. Many Singaporeans have been investing in property across the causeway in recent years, attracted by comparatively low prices. However, those who need to exit their investments quickly are likely to face significant challenges when trying to sell their property – and could finally be stuck with a property they can't sell. Career improvement programmes, in-house valuation, auctions and administrative help, venture advertising and marketing, skilled talks and traisning are continuously planned for the sales associates to help them obtain better outcomes for his or her shoppers while at Knight Frank Singapore. No change Present Rules

    Extending the tax exemption would help. The exemption, which may be as a lot as $2 million per family, covers individuals who negotiate a principal reduction on their existing mortgage, sell their house short (i.e., for lower than the excellent loans), or take part in a foreclosure course of. An extension of theexemption would seem like a common-sense means to assist stabilize the housing market, but the political turmoil around the fiscal-cliff negotiations means widespread sense could not win out. Home Minority Chief Nancy Pelosi (D-Calif.) believes that the mortgage relief provision will be on the table during the grand-cut price talks, in response to communications director Nadeam Elshami. Buying or promoting of blue mild bulbs is unlawful.

    A vendor's stamp duty has been launched on industrial property for the primary time, at rates ranging from 5 per cent to 15 per cent. The Authorities might be trying to reassure the market that they aren't in opposition to foreigners and PRs investing in Singapore's property market. They imposed these measures because of extenuating components available in the market." The sale of new dual-key EC models will even be restricted to multi-generational households only. The models have two separate entrances, permitting grandparents, for example, to dwell separately. The vendor's stamp obligation takes effect right this moment and applies to industrial property and plots which might be offered inside three years of the date of buy. JLL named Best Performing Property Brand for second year running

    The data offered is for normal info purposes only and isn't supposed to be personalised investment or monetary advice. Motley Fool Singapore contributor Stanley Lim would not personal shares in any corporations talked about. Singapore private home costs increased by 1.eight% within the fourth quarter of 2012, up from 0.6% within the earlier quarter. Resale prices of government-built HDB residences which are usually bought by Singaporeans, elevated by 2.5%, quarter on quarter, the quickest acquire in five quarters. And industrial property, prices are actually double the levels of three years ago. No withholding tax in the event you sell your property. All your local information regarding vital HDB policies, condominium launches, land growth, commercial property and more

    There are various methods to go about discovering the precise property. Some local newspapers (together with the Straits Instances ) have categorised property sections and many local property brokers have websites. Now there are some specifics to consider when buying a 'new launch' rental. Intended use of the unit Every sale begins with 10 p.c low cost for finish of season sale; changes to 20 % discount storewide; follows by additional reduction of fiftyand ends with last discount of 70 % or extra. Typically there is even a warehouse sale or transferring out sale with huge mark-down of costs for stock clearance. Deborah Regulation from Expat Realtor shares her property market update, plus prime rental residences and houses at the moment available to lease Esparina EC @ Sengkang
  • 20 year-old Real Estate Agent Rusty from Saint-Paul, has hobbies and interests which includes monopoly, property developers in singapore and poker. Will soon undertake a contiki trip that may include going to the Lower Valley of the Omo.

    My blog: http://www.primaboinca.com/view_profile.php?userid=5889534
  • One of the biggest reasons investing in a Singapore new launch is an effective things is as a result of it is doable to be lent massive quantities of money at very low interest rates that you should utilize to purchase it. Then, if property values continue to go up, then you'll get a really high return on funding (ROI). Simply make sure you purchase one of the higher properties, reminiscent of the ones at Fernvale the Riverbank or any Singapore landed property Get Earnings by means of Renting

    In its statement, the singapore property listing - website link, government claimed that the majority citizens buying their first residence won't be hurt by the new measures. Some concessions can even be prolonged to chose teams of consumers, similar to married couples with a minimum of one Singaporean partner who are purchasing their second property so long as they intend to promote their first residential property. Lower the LTV limit on housing loans granted by monetary establishments regulated by MAS from 70% to 60% for property purchasers who are individuals with a number of outstanding housing loans on the time of the brand new housing purchase. Singapore Property Measures - 30 August 2010 The most popular seek for the number of bedrooms in Singapore is 4, followed by 2 and three. Lush Acres EC @ Sengkang

    Discover out more about real estate funding in the area, together with info on international funding incentives and property possession. Many Singaporeans have been investing in property across the causeway in recent years, attracted by comparatively low prices. However, those who need to exit their investments quickly are likely to face significant challenges when trying to sell their property – and could finally be stuck with a property they can't sell. Career improvement programmes, in-house valuation, auctions and administrative help, venture advertising and marketing, skilled talks and traisning are continuously planned for the sales associates to help them obtain better outcomes for his or her shoppers while at Knight Frank Singapore. No change Present Rules

    Extending the tax exemption would help. The exemption, which may be as a lot as $2 million per family, covers individuals who negotiate a principal reduction on their existing mortgage, sell their house short (i.e., for lower than the excellent loans), or take part in a foreclosure course of. An extension of theexemption would seem like a common-sense means to assist stabilize the housing market, but the political turmoil around the fiscal-cliff negotiations means widespread sense could not win out. Home Minority Chief Nancy Pelosi (D-Calif.) believes that the mortgage relief provision will be on the table during the grand-cut price talks, in response to communications director Nadeam Elshami. Buying or promoting of blue mild bulbs is unlawful.

    A vendor's stamp duty has been launched on industrial property for the primary time, at rates ranging from 5 per cent to 15 per cent. The Authorities might be trying to reassure the market that they aren't in opposition to foreigners and PRs investing in Singapore's property market. They imposed these measures because of extenuating components available in the market." The sale of new dual-key EC models will even be restricted to multi-generational households only. The models have two separate entrances, permitting grandparents, for example, to dwell separately. The vendor's stamp obligation takes effect right this moment and applies to industrial property and plots which might be offered inside three years of the date of buy. JLL named Best Performing Property Brand for second year running

    The data offered is for normal info purposes only and isn't supposed to be personalised investment or monetary advice. Motley Fool Singapore contributor Stanley Lim would not personal shares in any corporations talked about. Singapore private home costs increased by 1.eight% within the fourth quarter of 2012, up from 0.6% within the earlier quarter. Resale prices of government-built HDB residences which are usually bought by Singaporeans, elevated by 2.5%, quarter on quarter, the quickest acquire in five quarters. And industrial property, prices are actually double the levels of three years ago. No withholding tax in the event you sell your property. All your local information regarding vital HDB policies, condominium launches, land growth, commercial property and more

    There are various methods to go about discovering the precise property. Some local newspapers (together with the Straits Instances ) have categorised property sections and many local property brokers have websites. Now there are some specifics to consider when buying a 'new launch' rental. Intended use of the unit Every sale begins with 10 p.c low cost for finish of season sale; changes to 20 % discount storewide; follows by additional reduction of fiftyand ends with last discount of 70 % or extra. Typically there is even a warehouse sale or transferring out sale with huge mark-down of costs for stock clearance. Deborah Regulation from Expat Realtor shares her property market update, plus prime rental residences and houses at the moment available to lease Esparina EC @ Sengkang
  • 20 year-old Real Estate Agent Rusty from Saint-Paul, has hobbies and interests which includes monopoly, property developers in singapore and poker. Will soon undertake a contiki trip that may include going to the Lower Valley of the Omo.

    My blog: http://www.primaboinca.com/view_profile.php?userid=5889534
  • 55 years old Systems Administrator Antony from Clarence Creek, really loves learning, PC Software and aerobics. Likes to travel and was inspired after making a journey to Historic Ensemble of the Potala Palace.

    You can view that web-site... ccleaner free download
  • 20 year-old Real Estate Agent Rusty from Saint-Paul, has hobbies and interests which includes monopoly, property developers in singapore and poker. Will soon undertake a contiki trip that may include going to the Lower Valley of the Omo.

    My blog: http://www.primaboinca.com/view_profile.php?userid=5889534
  • One of the biggest reasons investing in a Singapore new launch is an effective things is as a result of it is doable to be lent massive quantities of money at very low interest rates that you should utilize to purchase it. Then, if property values continue to go up, then you'll get a really high return on funding (ROI). Simply make sure you purchase one of the higher properties, reminiscent of the ones at Fernvale the Riverbank or any Singapore landed property Get Earnings by means of Renting

    In its statement, the singapore property listing - website link, government claimed that the majority citizens buying their first residence won't be hurt by the new measures. Some concessions can even be prolonged to chose teams of consumers, similar to married couples with a minimum of one Singaporean partner who are purchasing their second property so long as they intend to promote their first residential property. Lower the LTV limit on housing loans granted by monetary establishments regulated by MAS from 70% to 60% for property purchasers who are individuals with a number of outstanding housing loans on the time of the brand new housing purchase. Singapore Property Measures - 30 August 2010 The most popular seek for the number of bedrooms in Singapore is 4, followed by 2 and three. Lush Acres EC @ Sengkang

    Discover out more about real estate funding in the area, together with info on international funding incentives and property possession. Many Singaporeans have been investing in property across the causeway in recent years, attracted by comparatively low prices. However, those who need to exit their investments quickly are likely to face significant challenges when trying to sell their property – and could finally be stuck with a property they can't sell. Career improvement programmes, in-house valuation, auctions and administrative help, venture advertising and marketing, skilled talks and traisning are continuously planned for the sales associates to help them obtain better outcomes for his or her shoppers while at Knight Frank Singapore. No change Present Rules

    Extending the tax exemption would help. The exemption, which may be as a lot as $2 million per family, covers individuals who negotiate a principal reduction on their existing mortgage, sell their house short (i.e., for lower than the excellent loans), or take part in a foreclosure course of. An extension of theexemption would seem like a common-sense means to assist stabilize the housing market, but the political turmoil around the fiscal-cliff negotiations means widespread sense could not win out. Home Minority Chief Nancy Pelosi (D-Calif.) believes that the mortgage relief provision will be on the table during the grand-cut price talks, in response to communications director Nadeam Elshami. Buying or promoting of blue mild bulbs is unlawful.

    A vendor's stamp duty has been launched on industrial property for the primary time, at rates ranging from 5 per cent to 15 per cent. The Authorities might be trying to reassure the market that they aren't in opposition to foreigners and PRs investing in Singapore's property market. They imposed these measures because of extenuating components available in the market." The sale of new dual-key EC models will even be restricted to multi-generational households only. The models have two separate entrances, permitting grandparents, for example, to dwell separately. The vendor's stamp obligation takes effect right this moment and applies to industrial property and plots which might be offered inside three years of the date of buy. JLL named Best Performing Property Brand for second year running

    The data offered is for normal info purposes only and isn't supposed to be personalised investment or monetary advice. Motley Fool Singapore contributor Stanley Lim would not personal shares in any corporations talked about. Singapore private home costs increased by 1.eight% within the fourth quarter of 2012, up from 0.6% within the earlier quarter. Resale prices of government-built HDB residences which are usually bought by Singaporeans, elevated by 2.5%, quarter on quarter, the quickest acquire in five quarters. And industrial property, prices are actually double the levels of three years ago. No withholding tax in the event you sell your property. All your local information regarding vital HDB policies, condominium launches, land growth, commercial property and more

    There are various methods to go about discovering the precise property. Some local newspapers (together with the Straits Instances ) have categorised property sections and many local property brokers have websites. Now there are some specifics to consider when buying a 'new launch' rental. Intended use of the unit Every sale begins with 10 p.c low cost for finish of season sale; changes to 20 % discount storewide; follows by additional reduction of fiftyand ends with last discount of 70 % or extra. Typically there is even a warehouse sale or transferring out sale with huge mark-down of costs for stock clearance. Deborah Regulation from Expat Realtor shares her property market update, plus prime rental residences and houses at the moment available to lease Esparina EC @ Sengkang
  • 55 years old Systems Administrator Antony from Clarence Creek, really loves learning, PC Software and aerobics. Likes to travel and was inspired after making a journey to Historic Ensemble of the Potala Palace.

    You can view that web-site... ccleaner free download
  • 20 year-old Real Estate Agent Rusty from Saint-Paul, has hobbies and interests which includes monopoly, property developers in singapore and poker. Will soon undertake a contiki trip that may include going to the Lower Valley of the Omo.

    My blog: http://www.primaboinca.com/view_profile.php?userid=5889534
  • 20 year-old Real Estate Agent Rusty from Saint-Paul, has hobbies and interests which includes monopoly, property developers in singapore and poker. Will soon undertake a contiki trip that may include going to the Lower Valley of the Omo.

    My blog: http://www.primaboinca.com/view_profile.php?userid=5889534
  • 20 year-old Real Estate Agent Rusty from Saint-Paul, has hobbies and interests which includes monopoly, property developers in singapore and poker. Will soon undertake a contiki trip that may include going to the Lower Valley of the Omo.

    My blog: http://www.primaboinca.com/view_profile.php?userid=5889534

External links