Prioritized experience replay
- In general, experience replay can reduce the amount of experience required to learn, and replace it with more computation and more memory – which are often cheaper resources than the RL agent’s interactions with its environment.
- Transitions (between states) may be more or less
- surprising (does the system in question have a model of the environment? It does have a model of the state & action expected reward, as it's Q-learning.
- redundant, or
- task-relevant
- Some sundry neuroscience links:
- Sequences associated with rewards appear to be replayed more frequently (Atherton et al., 2015; OÌlafsdoÌttir et al., 2015; Foster & Wilson, 2006). Experiences with high magnitude TD error also appear to be replayed more often (Singer & Frank, 2009 PMID-20064396 ; McNamara et al., 2014).
- Pose a useful example where the task is to learn (effectively) a random series of bits -- 'Blind Cliffwalk'. By choosing the replayed experiences properly (via an oracle), you can get an exponential speedup in learning.
- Prioritized replay introduces bias because it changes [the sampled state-action] distribution in an uncontrolled fashion, and therefore changes the solution that the estimates will converge to (even if the policy and state distribution are fixed). We can correct this bias by using importance-sampling (IS) weights.
- These weights are the inverse of the priority weights, but don't matter so much at the beginning, when things are more stochastic; they anneal the controlling exponent.
- There are two ways of selecting (weighting) the priority weights:
- Direct, proportional to the TD-error encountered when visiting a sequence.
- Ranked, where errors and sequences are stored in a data structure ordered based on error and sampled .
- Somewhat illuminating is how the deep TD or Q learning is unable to even scratch the surface of Tetris or Montezuma's Revenge.
|