|
|
Line 1: |
Line 1: |
| A '''partially observable Markov decision process''' (POMDP) is a generalization of a [[Markov decision process]]. A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a probability distribution over the set of possible states, based on a set of observations and observation probabilities, and the underlying MDP.
| | Hello and welcome. My title is Figures Wunder. My day occupation is a meter reader. One of the extremely best things in the world for me is to do aerobics and now I'm attempting to earn money with it. For many years he's been residing in North Dakota and his family enjoys it.<br><br>My site; [http://fankut.com/index.php?do=/profile-2572/info/ fankut.com] |
| | |
| The POMDP framework is general enough to model a variety of real-world sequential decision processes. Applications include robot navigation problems, machine maintenance, and planning under uncertainty in general. The framework originated in the [[operations research]] community, and was later taken over by the [[artificial intelligence]] and [[automated planning]] communities.
| |
| | |
| An exact solution to a POMDP yields the optimal action for each possible belief over the world states. The optimal action maximizes (or minimizes) the expected reward (or cost) of the agent over a possibly infinite horizon. The sequence of optimal actions is known as the optimal policy of the agent for interacting with its environment.
| |
| | |
| ==Definition==
| |
| ===Formal definition===
| |
| A discrete-time POMDP models the relationship between an agent and its environment. Formally, a POMDP is a tuple <math>(S,A,O,T,\Omega,R)</math>, where
| |
| * <math>S</math> is a set of states,
| |
| * <math>A</math> is a set of actions,
| |
| * <math>O</math> is a set of observations,
| |
| * <math>T</math> is a set of conditional transition probabilities,
| |
| * <math>\Omega</math> is a set of conditional observation probabilities,
| |
| * <math>R: A \times S \to \mathbb{R}</math> is the reward function.
| |
| | |
| At each time period, the environment is in some state <math>s \in S</math>. The agent takes an action <math>a \in A</math>,
| |
| which causes the environment to transition to state <math>s'</math> with probability <math>T(s'\mid s,a)</math>. Finally, the agent receives a reward with expected value, say <math>r</math>, and the process repeats.
| |
| | |
| ===Discussion===
| |
| The difficulty is that the agent does not know the exact state it is in. Instead, it must maintain a probability distribution, known as the belief state, over the possible states <math>S</math>.
| |
| | |
| It is instructive to compare the above definition with the definition of a [[Markov decision process#Definition|Markov decision process]]. An MDP does not include the observation set, and its reward function is a function of two states.
| |
| | |
| ==Belief update==
| |
| An agent needs to update its belief upon taking the action <math>a</math> and observing <math>o</math>. Since the state is Markovian, maintaining a belief over the states solely requires knowledge of the previous belief state, the action taken, and the current observation. The operation is denoted <math>b' = \tau(b,a,o)</math>. Below we describe how this belief update is computed.
| |
| | |
| In <math>s'</math>, the agent observes <math>o \in O</math> with probability <math>\Omega(o\mid s',a)</math>. Let <math>b</math> be a probability distribution over the state space <math>S</math>. <math>b(s)</math> denotes the probability that the environment is in state <math>s</math>. Given <math>b(s)</math>, then after taking action <math>a</math> and observing <math>o</math>,
| |
| :<math>
| |
| b'(s') = \eta \Omega(o\mid s',a) \sum_{s\in S} T(s'\mid s,a)b(s)
| |
| </math>
| |
| where <math>\eta=1/P(o\mid b,a)</math> is a normalizing constant with <math>P(o\mid b,a) = \sum_{s'\in S}\Omega(o\mid s',a)\sum_{s\in S}T(s'\mid s,a)b(s)</math>.
| |
| | |
| ==Belief MDP==
| |
| The policy maps a belief state space into the action space. The optimal policy can be understood as the solution of a continuous space [[Markov decision process]] <ref>{{cite journal |doi=10.1016/S0004-3702(98)00023-X |author=Kaelbling, L.P., Littman, M.L., Cassandra, A.R. |title=Planning and acting in partially observable stochastic domains |journal=Artificial Intelligence Journal |volume=101 |pages=99–134 |year=1998 }}</ref> (so-called belief MDP). It is defined as a tuple <math>(B,A,\tau,r)</math> where
| |
| * <math>B</math> is the set of belief states over the POMDP states,
| |
| * <math>A</math> is the same finite set of action as for the original POMDP,
| |
| * <math>\tau</math> is the belief state transition function,
| |
| * <math>r:B \times A \to \mathbb{R}</math> is the reward function on belief states. It writes :
| |
| <math>r(b,a) = \sum_{s\in S} b(s) R(s,a)</math>.
| |
| | |
| Note that this MDP is defined over a continuous state space.
| |
| | |
| ===Policy and Value Function===
| |
| The agent's policy <math>\pi</math> specifies an action <math>a=\pi(b)</math> for any belief <math>b</math>. Here it is assumed the objective is to maximize the expected total discounted reward over an infinite horizon. When <math>R</math> defines a cost, the objective becomes the minimization of the expected cost.
| |
| | |
| The expected reward for policy <math>\pi</math> starting from belief <math>b_0</math> is defined as
| |
| :<math>
| |
| J^\pi(b_0) = \sum_{t=0}^\infty \gamma^t r(b_t, a_t) = \sum_{t=0}^\infty \gamma^t E\Bigl[ R(s_t,a_t) \mid b_0, \pi \Bigr]
| |
| </math>
| |
| where <math>\gamma<1</math> is the discount factor. The optimal policy <math>\pi^*</math> is obtained by optimizing the long-term reward.
| |
| :<math>
| |
| \pi^* = \underset{\pi}{\mbox{argmax}} J^\pi(b_0)
| |
| </math>
| |
| where <math>b_0</math> is the initial belief.
| |
| | |
| The optimal policy, denoted by <math>\pi^*</math>, yields the highest expected reward value for each belief state, compactly represented by the optimal value function <math>V^*</math>. This value function is solution to the [[Bellman equation|Bellman optimality equation]]:
| |
| :<math>
| |
| V^*(b) = \max_{a\in A}\Bigl[ r(b,a) + \gamma\sum_{o\in O} \Omega(o\mid b,a) V^*(\tau(b,a,o)) \Bigr]
| |
| </math>
| |
| For finite-horizon POMDPs, the optimal value function is piecewise-linear and convex.<ref>{{Cite thesis |degree=PhD |title=The optimal control of partially observable Markov processes |last=Sondik |first=E.J. |year=1971 |publisher=Stanford University }}</ref> It can be represented as a finite set of vectors. In the infinite-horizon formulation, a finite vector set can approximate <math>V^*</math> arbitrarily closely, whose shape remains convex. Value iteration applies dynamic programming update to gradually improve on the value until convergence to an <math>\epsilon</math>-optimal value function, and preserves its piecewise linearity and convexity.<ref>{{cite journal |doi=10.1287/opre.21.5.1071 |author=Smallwood, R.D., Sondik, E.J. |title=The optimal control of partially observable Markov decision processes over a finite horizon |journal=Operations Research |volume=21 |issue=5 |pages=1071–88 |year=1973 }}</ref> By improving the value, the policy is implicitly improved. Another dynamic programming technique called policy iteration explicitly represents and improves the policy instead.<ref>{{cite journal |doi=10.1287/opre.26.2.282 |author=Sondik, E.J. |title=The optimal control of partially observable Markov processes over the infinite horizon: discounted cost |journal=Operations Research |volume=26 |issue=2 |pages=282–304 |year=1978 }}</ref><ref>{{cite conference |booktitle=Proceedings of the Fourteenth International Conference on Uncertainty In Artificial Intelligence (UAI-98) |title=Solving POMDPs by searching in policy space |first=E. |last=Hansen |year=1998 }}</ref>
| |
| | |
| ==Approximate POMDP solutions==
| |
| In practice, POMDPs are often computationally [[Computational_complexity_theory#Intractability|intractable]] to solve exactly, so computer scientists have developed methods that approximate solutions for POMDPs.
| |
| | |
| Grid-based algorithms <ref>{{cite journal |doi=10.1287/opre.39.1.162 |author=Lovejoy, W. |title=Computationally feasible bounds for partially observed Markov decision processes |journal=Operations Research |volume=39 |pages=162–175 |year=1991 }}</ref> comprise one approximate solution technique. In this approach, the value function is computed for a set of points in the belief space, and interpolation is used to determine the optimal action to take for other belief states that are encountered which are not in the set of grid points. More recent work makes use of sampling techniques, generalization techniques and exploitation of problem structure, and has extended POMDP solving into large domains with millions of states <ref name=hoey>{{cite conference |title=Assisting Persons with Dementia during Handwashing Using a Partially Observable Markov Decision Process |author=Jesse Hoey, Axel von Bertoldi, Pascal Poupart, Alex Mihailidis|booktitle=Proc. International Conference on Computer Vision Systems (ICVS) |date=2007 |doi=10.2390/biecoll-icvs2007-89}}</ref><ref name=hoeyCVIU>{{cite journal |doi=10.1016/j.cviu.2009.06.008 | title=Automated Handwashing Assistance For Persons With Dementia Using Video and a Partially Observable Markov Decision Process |author=Jesse Hoey, Pascal Poupart, Axel von Bertoldi, Tammy Craig, Craig Boutilier, Alex Mihailidis. |journal=Computer Vision and Image Understanding (CVIU) |volume=114 |number=5 | year=2010}}</ref> For example, point-based methods sample random reachable belief points to constrain the planning to relevant areas in the belief space. <ref>{{cite conference |title=Point-based value iteration: An anytime algorithm for POMDPs |author=Pineau, J., Gordon, G., Thrun, S. |booktitle=International Joint Conference on Artificial Intelligence (IJCAI). Acapulco, Mexico |date=August 2003 |pages=1025–32 }}</ref>
| |
| Dimensionality reduction using [[Principle component analysis|PCA]] has also been explored.<ref>{{cite book |author=Roy, Nicholas; Gordon, Geoffrey |chapter=Exponential Family PCA for Belief Compression in POMDPs |title=Advances in Neural Information Processing Systems |year=2003 }}</ref>
| |
| | |
| ==POMDP uses==
| |
| POMDPs model many kinds of real-world problems. Notable works include the use of a POMDP in assistive technology for persons with dementia <ref name=hoey/><ref name=hoeyCVIU/> and the conservation of the critically endangered and difficult to detect Sumatran tigers.<ref name="chades">{{cite journal |author=Chadès, I., McDonald-Madden, E., McCarthy, M.A., Wintle, B., Linkie, M., Possingham, H.P. |title=When to stop managing or surveying cryptic threatened species |journal=Proc. Natl. Acad. Sci. U.S.A. |volume=105 |issue=37 |pages=13936–40 |date=16 September 2008 |url=http://www.pnas.org/content/105/37/13936.abstract |pmid=18779594 |pmc=2544557 |doi=10.1073/pnas.0805265105 |bibcode=2008PNAS..10513936C}}</ref>
| |
| | |
| ==References==
| |
| {{reflist}}
| |
| | |
| ==External links==
| |
| * [http://www.cassandra.org/pomdp/index.shtml Tony Cassandra's POMDP pages] with a tutorial, examples of problems modeled as POMDPs, and software for solving them.
| |
| * [http://www.cs.cmu.edu/~trey/zmdp/ zmdp], a POMDP solver by Trey Smith
| |
| * [http://bigbird.comp.nus.edu.sg/pmwiki/farm/appl/index.php?n=Main.HomePage APPL], a fast point-based POMDP solver
| |
| * [http://www.cs.uwaterloo.ca/~jhoey/research/spudd/ SPUDD], a factored structured (PO)MDP solver that uses algebraic decision diagrams (ADDs).
| |
| * [http://bitbucket.org/bami/pypomdp pyPOMDP], a (PO)MDP toolbox (simulator, solver, learner, file reader) for Python by Oliver Stollmann and Bastian Migge
| |
| | |
| [[Category:Dynamic programming]]
| |
| [[Category:Markov processes]]
| |
| [[Category:Stochastic control]]
| |