Optimal Value Functions
Solving a reinforcement learning task means, roughly, finding a policy that achieves a lot of reward over the long run. For finite MDPs, we can precisely define an optimal policy in the following way. Value functions define a partial ordering over policies. A policy π is defined to be better than or equal to a policy π 0 if its expected return is greater than or equal to that of π 0 for all states.
In other words, π ≥ π 0 if and only if vπ(s) ≥ vπ0(s) for all s ∈ S. There is always at least one policy that is better than or equal to all other policies. This is an optimal policy. Although there may be more than one, we denote all the optimal policies by π∗. They share the same state-value function, called the optimal state-value function, denoted v∗, and defined as
v∗(s) = max π vπ(s),
for all s ∈ S.
Optimal policies also share the same optimal action-value function, denoted q∗, and defined as
q∗(s, a) = max π qπ(s, a),
for all s ∈ S and a ∈ A(s). For the state–action pair (s, a), this function gives the expected return for taking action a in state s and thereafter following an optimal policy. Thus, we can write q∗ in terms of v∗ as follows:
q∗(s, a) = E[Rt+1 + γv∗(St+1) | St =s, At =a]