use https for features.
text: sort by
tags: modified
type: chronology
hide / / print
ref: -0 tags: tennenbaum compositional learning character recognition one-shot learning date: 09-29-2020 03:44 gmt revision:1 [0] [head]

One-shot learning by inverting a compositional causal process

  • Brenden Lake, Russ Salakhutdinov, Josh Tennenbaum
  • This is the paper that preceded the 2015 Science publication "Human level concept learning through probabalistic program induction"
  • Because it's a NIPS paper, and not a science paper, this one is a bit more accessible: the logic to the details and developments is apparent.
  • General idea: build up a fully probabilistic model of multi-language (omniglot corpus) characters / tokens. This model includes things like character type / alphabet, number of strokes, curvature of strokes (parameterized via bezier splines), where strokes attach to others
(spatial relations), stroke scale, and character scale. The model (won't repeat the formal definition) is factorized to be both compositional and causal, though all the details of the conditional probs are likely left to the supplemental material.
  • They fit the complete model to the Omniglot data using gradient descent + image-space noising, e.g tweak the free parameters of the model to generate images that look like the human created characters. (This too is in the supplement).
  • Because the model is high-dimensional and hard to invert, they generate a perceptual model by winnowing down the image into a skeleton, then breaking this into a variable number of strokes.
    • The probabilistic model then assigns a log-likelihood to each of the parses.
    • They then use the model with Metropolis-Hastings MCMC to sample a region in parameter space around each parse -- but they sample ψ\psi (the character type) to get a greater weighted diversity of types.
      • Surprisingly, they don't estimate the image likelihood - which is expensive - they here just re-do the parsing based on aggregate info embedded in the statistical model. Clever.
  • ψ\psi is the character type (a, b, c..), ψ=κ,S,R\psi = { \kappa, S, R } where kappa are the number of strokes, S is a set of parameterized strokes, R are the relations between strokes.
  • θ\theta are the per-token stroke parameters.
  • II is the image, obvi.
  • Classification task: one image of a new character (c) vs 20 characters new characters from the same alphabet (test, (t)). In the 20 there is one character of the same type -- task is to find it.
  • With 'hierarchical bayesian program learning', they not only anneal the type to the parameters (with MCMC, above) for the test image, but they also fit the parameters using gradient descent to the image.
    • Subsequently parses the test image onto the class image (c)
    • Hence the best classification is the one where both are in the best agreement: argmaxcP(c|t)P(c)P(t|c)\underset{c}{argmax} \frac{P(c|t)}{P(c)} P(t|c) where P(c)P(c) is approximated as the parse weights.
      • Again, this is clever as it allows significant information leakage between (c) and (t) ...
      • The other models (Affine, Deep Boltzman Machines, Hierarchical Deep Model) have nothing like this -- they are feed-forward.
  • No wonder HBPL performs better. It's a better model of the data, that has a bidirectional fitting routine.

  • As i read the paper, had a few vague 'hedons':
    • Model building is essential. But unidirectional models are insufficient; if the models include the mechanism for their own inversion many fitting and inference problems are solved. (Such is my intuition)
      • As a corrolary of this, having both forward and backward tags (links) can be used to neatly solve the binding problem. This should be easy in a computer w/ pointers, though in the brain I'm not sure how it might work (?!) without some sort of combinatorial explosion?
    • The fitting process has to be multi-pass or at least re-entrant. Both this paper and the Vicarious CAPTCHA paper feature statistical message passing to infer or estimate hidden explanatory variables. Seems correct.
    • The model here includes relations that are conditional on stroke parameters that occurred / were parsed beforehand; this is very appealing in that the model/generator/AI needs to be flexibly re-entrant to support hierarchical planning ...

hide / / print
ref: -2012 tags: DiCarlo Visual object recognition inferior temporal cortex dorsal ventral stream V1 date: 03-13-2019 22:24 gmt revision:1 [0] [head]

PMID-22325196 How Does the Brain Solve Visual Object Recognition

  • James DiCarlo, Davide Zoccolan, Nicole C Rust.
  • Infero-temporal cortex is organized into behaviorally relevant categories, not necessarily retinotopically, as demonstrated with TMS studies in humans, and lesion studies in other primates.
    • Synaptic transmission takes 1-2ms; dendritic propagation ?, axonal propagation ~1ms (e.g. pyramidal antidromic activation latency 1.2-1.3ms), so each layer can use several synapses for computation.
  • Results from the ventral stream computation can be well described by a firing rate code binned at ~ 50ms. Such a code can reliably describe and predict behavior
    • Though: this does not rule out codes with finer temporal resolution.
    • Though anyway: it may be inferential issue, as behavior operates at this timescale.
  • IT neurons' responses are sparse, but still contain information about position and size.
    • They are not narrowly tuned detectors, not grandmother cells; they are selective and complex but not narrow.
    • Indeed, IT neurons with the highest shape selectivities are the least tolerate to changes in position, scale, contrast, and visual clutter. (Zoccolan et al 2007)
    • Position information avoids the need to re-bind attributes with perceptual categories -- no need for syncrhony binding.
  • Decoded IT population activity of ~100 neurons exceeds artificial vision systems (Pinto et al 2010).
  • As in {1448}, there is a ~ 30x expansion of the number of neurons (axons?) in V1 vs the optic tract; serves to allow controlled sparsity.
  • Dispute in the field over primarily hierarchical & feed-forward vs. highly structured feedback being essential for performance (and learning?) of the system.
    • One could hypothesize that feedback signals help lower levels perform inference with noisy inputs; or feedback from higher layers, which is prevalent and manifest (and must be important; all that membrane is not wasted..)
    • DiCarlo questions if the re-entrant intra-area and inter-area communication is necessary for building object representations.
      • This could be tested with optogenetic approaches; since the publication, it may have been..
      • Feedback-type active perception may be evinced in binocular rivalry, or in visual illusions;
      • Yet 150ms immediate object recognition probably does not require it.
  • Authors propose thinking about neurons/local circuits as having 'job descriptions', an metaphor that couples neuroscience to human organization: who is providing feedback to the workers? Who is providing feeback as to job function? (Hinton 1995).
  • Propose local subspace untangling; when this is tacked and tiled, this is sufficient for object perception.
    • Indeed, modern deep convolutional networks behave this way; yet they still can't match human performance (perhaps not sparse enough, not enough representational capability)
    • Cite Hinton & Salakhutdinov 2006.
  • The AND-OR or conv-pooling architecture was proposed by Hubbel and Weisel back in 1962! In their paper's formulatin, they call it a Normalized non-linear model, NLN.
  1. Nonlinearities tend to flatten object manifolds; even with random weights, NLN models tend to produce easier to decode object identities, based on strength of normalization. See also {714}.
  2. NLNs are tuned / become tuned to the statistics of real images. But they do not get into discrimination / perception thereof..
  3. NLNs learn temporally: inputs that occur temporally adjacent lead to similar responses.
    1. But: scaades? Humans saccade 100 million times per year!
      1. This could be seen as a continuity prior: the world is unlikely to change between saccades, so one can infer the identity and positions of objects on the retina, which say can be used to tune different retinotopic IT neurons..
    2. See Li & DiCarlo -- manipulation of image statistics changing visual responses.
  • Regarding (3) above, perhaps attention is a modifier / learning gate?

hide / / print
ref: -0 tags: hinton convolutional deep networks image recognition 2012 date: 01-11-2014 20:14 gmt revision:0 [head]

ImageNet Classification with Deep Convolutional Networks