Oneshot learning by inverting a compositional causal process
 Brenden Lake, Russ Salakhutdinov, Josh Tennenbaum
 This is the paper that preceded the 2015 Science publication "Human level concept learning through probabalistic program induction"
 Because it's a NIPS paper, and not a science paper, this one is a bit more accessible: the logic to the details and developments is apparent.
 General idea: build up a fully probabilistic model of multilanguage (omniglot corpus) characters / tokens. This model includes things like character type / alphabet, number of strokes, curvature of strokes (parameterized via bezier splines), where strokes attach to others
(spatial relations), stroke scale, and character scale. The model (won't repeat the formal definition) is factorized to be both compositional and causal, though all the details of the conditional probs are likely left to the supplemental material.
 They fit the complete model to the Omniglot data using gradient descent + imagespace noising, e.g tweak the free parameters of the model to generate images that look like the human created characters. (This too is in the supplement).
 Because the model is highdimensional and hard to invert, they generate a perceptual model by winnowing down the image into a skeleton, then breaking this into a variable number of strokes.
 The probabilistic model then assigns a loglikelihood to each of the parses.
 They then use the model with MetropolisHastings MCMC to sample a region in parameter space around each parse  but they sample $\psi$ (the character type) to get a greater weighted diversity of types.
 Surprisingly, they don't estimate the image likelihood  which is expensive  they here just redo the parsing based on aggregate info embedded in the statistical model. Clever.
 $\psi$ is the character type (a, b, c..), $\psi = { \kappa, S, R }$ where kappa are the number of strokes, S is a set of parameterized strokes, R are the relations between strokes.
 $\theta$ are the pertoken stroke parameters.
 $I$ is the image, obvi.
 Classification task: one image of a new character (c) vs 20 characters new characters from the same alphabet (test, (t)). In the 20 there is one character of the same type  task is to find it.
 With 'hierarchical bayesian program learning', they not only anneal the type to the parameters (with MCMC, above) for the test image, but they also fit the parameters using gradient descent to the image.
 Subsequently parses the test image onto the class image (c)
 Hence the best classification is the one where both are in the best agreement: $\underset{c}{argmax} \frac{P(ct)}{P(c)} P(tc)$ where $P(c)$ is approximated as the parse weights.
 Again, this is clever as it allows significant information leakage between (c) and (t) ...
 The other models (Affine, Deep Boltzman Machines, Hierarchical Deep Model) have nothing like this  they are feedforward.
 No wonder HBPL performs better. It's a better model of the data, that has a bidirectional fitting routine.
 As i read the paper, had a few vague 'hedons':
 Model building is essential. But unidirectional models are insufficient; if the models include the mechanism for their own inversion many fitting and inference problems are solved. (Such is my intuition)
 As a corrolary of this, having both forward and backward tags (links) can be used to neatly solve the binding problem. This should be easy in a computer w/ pointers, though in the brain I'm not sure how it might work (?!) without some sort of combinatorial explosion?
 The fitting process has to be multipass or at least reentrant. Both this paper and the Vicarious CAPTCHA paper feature statistical message passing to infer or estimate hidden explanatory variables. Seems correct.
 The model here includes relations that are conditional on stroke parameters that occurred / were parsed beforehand; this is very appealing in that the model/generator/AI needs to be flexibly reentrant to support hierarchical planning ...
