Curiosity-driven exploration by Self-supervised prediction
- Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, Trevor Darrell
- Key insight: “we only predict the changes in the environment that could possibly be due to actions of our agent or affect the agent, and ignore the restâ€.
- Instead of making predictions in the sensory space (e.g. pixels), we transform the sensory input into a feature space where only the information relevant to the agent is represented.
- We learn this feature space using self-supervision -- training a neural network via a proxy inverse dynamics task -- predicting the agent’s action from the past and future sensory states.
- We then use this inverse model to train a forward dynamics model to predict feature representation of the next state from present feature representation and action.
- The difference between expected and actual representation serves as a reward signal for the agent.
- Quasi actor-critic / adversarial agent design, again.
-
- Used the asynchronous advantage actor critic policy gradient method (Mnih et al 2016 Asynchronous Methods for Deep Reinforcement Learning).
- Compare with variational information maximization (VIME) trained with TRPO (Trust region policy optimization) which is “more sample efficient than A3C but takes more wall timeâ€.
- References / concurrent work: Several methods propose improving data efficiency of RL algorithms using self-supervised prediction based auxiliary tasks (Jaderberg et al., 2017; Shelhamer et al., 2017).
- An interesting direction for future research is to use the learned exploration behavior / skill as a motor primitive / low level policy in a more complex, hierarchical system. For example, the skill of walking along corridors could be used as part of a navigation system.
|