One-shot learning by inverting a compositional causal process
- Brenden Lake, Russ Salakhutdinov, Josh Tennenbaum
- This is the paper that preceded the 2015 Science publication "Human level concept learning through probabalistic program induction"
- Because it's a NIPS paper, and not a science paper, this one is a bit more accessible: the logic to the details and developments is apparent.
- General idea: build up a fully probabilistic model of multi-language (omniglot corpus) characters / tokens. This model includes things like character type / alphabet, number of strokes, curvature of strokes (parameterized via bezier splines), where strokes attach to others (spatial relations), stroke scale, and character scale. The model (won't repeat the formal definition) is factorized to be both compositional and causal, though all the details of the conditional probs are left to the supplemental material.
- They fit the complete model to the Omniglot data using gradient descent + image-space noising, e.g tweak the free parameters of the model to generate images that look like the human created characters. (This too is in the supplement).
- Because the model is high-dimensional and hard to invert, they generate a perceptual model by winnowing down the image into a skeleton, then breaking this into a variable number of strokes.
- The probabilistic model then assigns a log-likelihood to each of the parses.
- They then use the model with Metropolis-Hastings MCMC to sample a region in parameter space around each parse -- and they extra sample (the character type) to get a greater weighted diversity of types.
- Surprisingly, they don't estimate the image likelihood - which is expensive - they here just re-do the parsing based on aggregate info embedded in the statistical model. Clever.
- is the character type (a, b, c..), where kappa are the number of strokes, S is a set of parameterized strokes, R are the relations between strokes.
- are the per-token stroke parameters.
- is the image, obvi.
- Classification task: one image of a new character (c) vs 20 characters new characters from the same alphabet (test, (t)). In the 20 there is one character of the same type -- task is to find it.
- With 'hierarchical bayesian program learning', they not only anneal the type to the parameters (with MCMC, above) for the test image, but they also fit the parameters using gradient descent to the image.
- Subsequently parses the test image onto the class image (c)
- Hence the best classification is the one where both are in the best agreement: where is approximated as the parse weights.
- Again, this is clever as it allows significant information leakage between (c) and (t) ...
- The other models (Affine, Deep Boltzman Machines, Hierarchical Deep Model) have nothing like this -- they are feed-forward.
- No wonder HBPL performs better. It's a better model of the data, that has a bidirectional fitting routine.
- As i read the paper, had a few vague 'hedons':
- Model building is essential. But unidirectional models are insufficient; if the models include the mechanism for their own inversion many fitting and inference problems are solved. (Such is my intuition)
- As a corrolary of this, having both forward and backward tags (links) can be used to neatly solve the binding problem. This should be easy in a computer w/ pointers, though in the brain I'm not sure how it might work (?!) without some sort of combinatorial explosion?
- The fitting process has to be multi-pass or at least re-entrant. Both this paper and the Vicarious CAPTCHA paper feature statistical message passing to infer or estimate hidden explanatory variables. Seems correct.
- The model here includes relations that are conditional on stroke parameters that occurred / were parsed beforehand; this is very appealing in that the model/generator/AI needs to be flexibly re-entrant to support hierarchical planning ...
|