You are not authenticated, login.
text: sort by
tags: modified
type: chronology
hide / / print
ref: -0 tags: variational free energy inference learning bayes curiosity insight Karl Friston date: 02-15-2019 02:09 gmt revision:1 [0] [head]

PMID-28777724 Active inference, curiosity and insight. Karl J. Friston, Marco Lin, Christopher D. Frith, Giovanni Pezzulo,

  • This has been my intuition for a while; you can learn abstract rules via active probing of the environment. This paper supports such intuitions with extensive scholarship.
  • “The basic theme of this article is that one can cast learning, inference, and decision making as processes that resolve uncertanty about the world.
    • References Schmidhuber 1991
  • “A learner should choose a policy that also maximizes the learner’s predictive power. This makes the world both interesting and exploitable.” (Still and Precup 2012)
  • “Our approach rests on the free energy principle, which asserts that any sentient creature must minimize the entropy of its sensory exchanges with the world.” Ok, that might be generalizing things too far..
  • Levels of uncertainty:
    • Perceptual inference, the causes of sensory outcomes under a particular policy
    • Uncertainty about policies or about future states of the world, outcomes, and the probabilistic contingencies that bind them.
  • For the last element (probabilistic contingencies between the world and outcomes), they employ Bayesian model selection / Bayesian model reduction
    • Can occur not only on the data, but exclusively on the initial model itself.
    • “We use simulations of abstract rule learning to show that context-sensitive contingiencies, which are manifest in a high-dimensional space of latent or hidden states, can be learned with straightforward variational principles (ie. minimization of free energy).
  • Assume that initial states and state transitions are known.
  • Perception or inference about hidden states (i.e. state estimation) corresponds to inverting a generative model gievn a sequence of outcomes, while learning involves updating the parameters of the model.
  • The actual task is quite simple: central fixation leads to a color cue. The cue + peripheral color determines either which way to saccade.
  • Gestalt: Good intuitions, but I’m left with the impression that the authors overexplain and / or make the description more complicated that it need be.
    • The actual number of parameters to to be inferred is rather small -- 3 states in 4 (?) dimensions, and these parameters are not hard to learn by minimizing the variational free energy:
    • F=D[Q(x)||P(x)]E q[ln(P(o t|x)]F = D[Q(x)||P(x)] - E_q[ln(P(o_t|x)] where D is the Kullback-Leibler divergence.
      • Mean field approximation: Q(x)Q(x) is fully factored (not here). many more notes

hide / / print
ref: -0 tags: curiosity exploration forward inverse models trevor darrell date: 02-01-2019 03:42 gmt revision:1 [0] [head]

Curiosity-driven exploration by Self-supervised prediction

  • Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, Trevor Darrell
  • Key insight: “we only predict the changes in the environment that could possibly be due to actions of our agent or affect the agent, and ignore the rest”.
    • Instead of making predictions in the sensory space (e.g. pixels), we transform the sensory input into a feature space where only the information relevant to the agent is represented.
    • We learn this feature space using self-supervision -- training a neural network via a proxy inverse dynamics task -- predicting the agent’s action from the past and future sensory states.
  • We then use this inverse model to train a forward dynamics model to predict feature representation of the next state from present feature representation and action.
      • The difference between expected and actual representation serves as a reward signal for the agent.
  • Quasi actor-critic / adversarial agent design, again.
  • Used the asynchronous advantage actor critic policy gradient method (Mnih et al 2016 Asynchronous Methods for Deep Reinforcement Learning).
  • Compare with variational information maximization (VIME) trained with TRPO (Trust region policy optimization) which is “more sample efficient than A3C but takes more wall time”.
  • References / concurrent work: Several methods propose improving data efficiency of RL algorithms using self-supervised prediction based auxiliary tasks (Jaderberg et al., 2017; Shelhamer et al., 2017).
  • An interesting direction for future research is to use the learned exploration behavior / skill as a motor primitive / low level policy in a more complex, hierarchical system. For example, the skill of walking along corridors could be used as part of a navigation system.