## Find the Cheese

## Markov Decision Process

## Overview

The single agent, Markov Decision Process model, is a non-deterministic search problem. With the goal of finding an optimal policy or set of actions to maximize the agent's expected reward.

In the MDP program below a value iteration algorithm is used to implicitly update a policy. The agent or mouse exists in a 10 by 10 grid and is restricted to move up, down, left, or right. Each move or action has a transition probability of reaching a future state given the current state and action taken.

In other words, when the mouse moves there is a probability it will move in the unintended direction. For instance, the mouse may take the action to move up and instead moves to the left.

## The Action & State Space

In figure 1, the intent of the action is conveyed. Each square represents a future state. The action label for each square corresponds to an intended future state given an action. In our mouse example, there are two transition probabilities.

The first transition probability P1 is the probability of reaching the intended future state given the current state and action taken. The second transition probability P2 is the complement of P1 or 1 - P1 divided by the total number of actions minus one which in this case is 3 since n = 4.

## a

## 3

## 1

## a

## a

## 2

## a

## 0

figure 1: Intent of Action

## Finding The Optimal Policy

A policy gives the agent an action for each state, and an optimal policy maximizes the expected utility. In other words given the current position of the mouse the policy provides the best action for the mouse to take. There are two types of policies, value iteration and policy iteration. In our mouse example value iteration is used to implicitly update a policy. The policy is derived by calculating the expected utility for taking an action from the current state.

To alleviate the ambiguity of the equation it is imperative you know the summation obtains the expected utility of taking an action in the current state. The variable gamma or discounting factor depreciates the expected utility. We depreciate the expected utility because we prefer positive rewards now rather than later. Lastly, since we are calculating the utility for the current state we add the reward corresponding to the current state.

## S

## t+1

## k

State Variable

State Position

State Iteration

## S

## t+1

## 3

## S

## t+1

## 2

## S

## t+1

## 0

## S

## t+1

## 1

figure 2: The State Space