These are exam preparation notes, subpar in quality and certainly not of divine quality.

See the index with all articles in this series

In reinforcement learning an actor is in a world where she can perform different actions and perceive the environment. Sometimes there may be rewards. Reinforcement learning is about choosing a policy from which to derive actions that maximize the reward. Just like the real world there are a lot of rewards that require a lot of foresight.

The data is provided in triplets of

Markov Decision Process (MDP)

In a MDP the actor is in a discrete state and has a discrete set of possible actions. Also there is a transition model (just because you picked an action, does not mean you deterministically land in a specific state) as well as a reward function (usually based on an action in a specific state, can be nondeterministically, too).


The policy sets how the agent behaves. Often given as probability of the state:

Markov chain

A markov chain is a sequence of states and actions, where the next state is drawn from transition model (see above).

In a markov chain all relevant information is expressed in the current state and there is no need to look back into the past.

A markov chain can also be expressed as a bipartite tree.

Ergodicity is when a markov chain may return from any state to that state aperiodically.

Policy Evaluation

A value function is an estimate of the expected value of the policy in the initial state. You could determine it at any other state, but the initial one makes most sense for policy evaluation.

It is an average over the markov chain, usually with a discount factor for future rewards. Unfortunately this is inefficient.

Bellman equation

See slide 29 As the value function contains the expected future discounted rewards, the value function looks a little like this:

Model-based vs model-free approaches

In model based learning the model consists of the immediate reward and the probability for the transition to the next state .

These two models need to be established by the actor playing in the environment, visting each state infinitely often.

This leads to this value iteration function:

Temporal Difference(TD) learning

In reality these models of and are hard to come by. Often the estimation of the policy evaluation has to happen online (also to reduce computational storage complexity???).

For this reason temporal difference learning is used, which keeps no model and only uses the immediate reward.