If you have a very large decision making problem and want to find an approximately optimal policy, one of the best ways is often to use simulated trajectories of states, actions and utilities to learn the policy from experience.

In many natural resource management problems running simulations is very expensive because of the complex processes involved and because of spatial interactions across a landscape. This means we need an approximate planning algorithm for MDPs that minimizes the number of calls to the simulator. Our paper at AAAI presents an algorithm for doing that.

### Example Problem : Invasive Species Management

One example of a natural resource management problem with this kind of challenge is management of invasive river plants. For example, Tamarisk is an invasive plant species originating from the Middle East which has invaded over 3 million acres of land in the Western United States. It outcompetes local plants, consumes water and deposits salt into the soil. This pushes out native grass species, fundamentally changes the chemistry soil and alters an ecosystem that many other species rely on (read more : studies of Tamarisk by NASA; how the Tamarisk Collective is removing Tamarisk and restoring riparian ecosystems). Dropped leaves also create a dry layer of fuel that increases the risk of fire in the already fire-prone West.

Seeds from plants can spread up or down the river network leading to a huge number of reachable states. There is a choice of treatment actions available in each part of the river: we can eradicate invading plants and/or reintroduce native ones. Each treatment action has a cost, but the more expensive treatments are more effective at supplanting the invading plants.

### The Planning Problem

The planning problem is the following: Find the optimal policy for performing treatments spatially across a river network and over time in order to restore the native plant population and stay within a given budget level.

This problem can be represented as a Markov Decision Process (MDP) but it very quickly becomes intractable to solve optimally for larger problems. Ideally we want to find a policy with guarantees about how far it is from the optimal solution. PAC-MDP learning methods provide such guarantees by using long simulations to converge on a policy that is guaranteed to be within a given distance of the optimal policy with some probability (see **Sidebar: What is an MDP? What does PAC-MDP mean?**).

Most of the existing PAC-MDP methods look at a sequence of simulated actions and rewards and rely on revisiting states many times over and over to learn how to act optimally in those states. This does not fit the ecosystem management problem. In reality, we begin in a particular starting state *S*, in which the ecosystem is typically in some undesirable state far from its desired balance. The goal is to find a policy for moving to a world where *S* *does not* occur again.

### Our Approach

Our paper improves upon the best approaches for doing approximate planning in large problems in two ways :

- It obtains tighter confidence intervals on the quality of a policy by incorporating a bound on the probability of reaching additional (not-yet-visited) states. These tighter intervals mean that fewer simulations are needed.
- It introduces a more strategic method for choosing which state would be best to sample next by maintaining a discounted occupancy measure on all known states.

Our work is based on the idea of being able to restart planning from a fixed start state at any time. This idea was originally put forward by Fiechter in 1994 [3]. However, many important innovations have been made in PAC-MDP community which we apply to our problem.

A fundamental feature of many PAC-MDP algorithms is optimism under uncertainty. That is, if there are some states we’ve never encountered or evaluated, then we assume they are high value states. This encourages the algorithm to try to reach unknown states and find out their true value. If such a state does indeed have high value then we benefit directly; if it’s a bad state, then we learn quickly and become less likely to visit the state again. For optimism under uncertainty to work, we need to have an estimate of how likely we are to encounter any particular state over time.

One way to do this is with a confidence interval on the probability for reaching a state. The confidence interval can be computed based on how many times we’ve visited that state in previous simulations. The confidence intervals used in previous algorithms are quite loose. Their width typically depends on the square root of the number of states in the the state space. In spatial ecosystem management problems the state space is exponentially large, so this leads to very wide intervals. However, another property of real-world problems can help us. Typically, when we apply an action in a state, the set of possible resulting states is small. This means that only a small fraction of all the states will actually be reached over the planning horizon. So we can use the Good-Turing estimator of missing mass[5] to put an upper bound on the total probability of all the states we have never visited. We integrate this upper bound into the existing confidence bounds and get a tighter one that better represents when to stop exploring.

Since the key performance cost that we are trying to minimize is the cost of invoking the simulator, the key is to invoke the simulator on the most interesting state at each time step. The second advance in our paper presents a new way to define “most interesting state” by using an upper bound on the ** occupancy measure.** The occupancy measure of a state is essentially an estimate of how important a state will be given how likely we are to visit it and how far it is from the starting state. More precisely, it is the discounted probability that the optimal policy will occupy the state summed over all time, starting from a fixed starting state. We can compute and update this occupancy measure by dynamic programming during planning. Our key observation is that we can estimate the impact of choosing to explore some state

*K*with action

*A*over another in an efficient way. We compute just the impact locally on the confidence interval for the value

*K*

*weighted by its occupancy measure.*This lets the algorithm focus exploration on the most promising state-action pairs.

We ran our algorithm on four different MDPs including a form of the invasive species river network described above and compared the results to the optimal solutions and results from some other algorithms. We found that our approach requires many fewer samples than previous methods to achieve similar performance. The algorithm does this while maintaining standard PAC bounds on optimality.

If you want to know more you can read our paper here. To take a look at the invasive species river network problem yourself there is a more detailed problem definition and downloadable code from this year’s RL planning competition which included our problem as one of the test domains.

### Presentation

**Date: Thursday July 18, 2013
**

**Location and time: Session 31E: MDPs and Sequential Processes (10:35am)**.

### Citation:

Thomas Dietterich, Majid Alkaee Taleghan and Mark Crowley. PAC Optimal Planning for Invasive Species Management : Improved Exploration for Reinforcement Learning from Simulator-Defined MDPs. Proceedings of AAAI-13, Bellevue, Washington, USA.

*This is one of our series of posts on the latest research in Computational Sustainability being presented at conferences this summer. This time Mark Crowley, a Postdoc in Computer Science at Oregon State University tells us about their new paper at AAAI in Bellevue, Washington, USA this July which has a special track on Computational Sustainability research.*

## Sidebar: What’s an MDP? What does PAC-MDP mean?

A Markov Decision Process(MDP)[1,2] is a standard mathematical formulation for decision making problems containing *states* describing the world, *actions* that can be taken in each state, *rewards* that represent the utility obtained for taking an action in a given state and *dynamics* which define a conditional probability of transitioning from one state to another given a particular action. The solution to an MDP is a policy that tells for each state what action to take in that state in order to optimize the long-term cumulative reward. Given an MDP we can define the *value* of a policy as the expected reward obtained by following the policy over an infinite planning horizon discounted so that states farther in the future have less impact on the value.

There is a wide literature on solving MDPs exactly. The computational cost of these methods scales as the product of the number of actions times the square of the number of states. One community of approximate method that are well studied are the *Probably Approximately Correct MDP (PAC-MDP)* methods[3,4]. These methods take the idea of PAC estimators from statistics and apply them to estimating the value function of a policy. Essentially, an algorithm for learning a policy is said to be PAC-MDP if we can show that there is at least a probability (1-δ) chance that the value of the policy is within ε of the value of the optimal policy. Further, the algorithm must be efficient: It must halt and return its policy within an amount of time that grows only polynomially in the sizes of the input variables.

## References

- Bellman, R. 1957. Dynamic Programming. New Jersey: Princeton University Press.
- Puterman, M. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Proba- bility and Mathematical Statistics.Wiley.
- Fiechter, C.-N. 1994. Efficient Reinforcement Learning. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, 88–97. ACM Press.
- Strehl, A., and Littman, M. 2008. An Analysis of Model- Based Interval Estimation for Markov Decision Processes. Journal of Computer and System Sciences 74(8):1309–1331.
- Good, I. J. 1953. The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika 40(3):237–264.