PMID-22325196 How Does the Brain Solve Visual Object Recognition
- James DiCarlo, Davide Zoccolan, Nicole C Rust.
- Infero-temporal cortex is organized into behaviorally relevant categories, not necessarily retinotopically, as demonstrated with TMS studies in humans, and lesion studies in other primates.
-
- Synaptic transmission takes 1-2ms; dendritic propagation ?, axonal propagation ~1ms (e.g. pyramidal antidromic activation latency 1.2-1.3ms), so each layer can use several synapses for computation.
- Results from the ventral stream computation can be well described by a firing rate code binned at ~ 50ms. Such a code can reliably describe and predict behavior
- Though: this does not rule out codes with finer temporal resolution.
- Though anyway: it may be inferential issue, as behavior operates at this timescale.
-
- IT neurons' responses are sparse, but still contain information about position and size.
- They are not narrowly tuned detectors, not grandmother cells; they are selective and complex but not narrow.
- Indeed, IT neurons with the highest shape selectivities are the least tolerate to changes in position, scale, contrast, and visual clutter. (Zoccolan et al 2007)
- Position information avoids the need to re-bind attributes with perceptual categories -- no need for syncrhony binding.
- Decoded IT population activity of ~100 neurons exceeds artificial vision systems (Pinto et al 2010).
- As in {1448}, there is a ~ 30x expansion of the number of neurons (axons?) in V1 vs the optic tract; serves to allow controlled sparsity.
- Dispute in the field over primarily hierarchical & feed-forward vs. highly structured feedback being essential for performance (and learning?) of the system.
- One could hypothesize that feedback signals help lower levels perform inference with noisy inputs; or feedback from higher layers, which is prevalent and manifest (and must be important; all that membrane is not wasted..)
- DiCarlo questions if the re-entrant intra-area and inter-area communication is necessary for building object representations.
- This could be tested with optogenetic approaches; since the publication, it may have been..
- Feedback-type active perception may be evinced in binocular rivalry, or in visual illusions;
- Yet 150ms immediate object recognition probably does not require it.
- Authors propose thinking about neurons/local circuits as having 'job descriptions', an metaphor that couples neuroscience to human organization: who is providing feedback to the workers? Who is providing feeback as to job function? (Hinton 1995).
- Propose local subspace untangling; when this is tacked and tiled, this is sufficient for object perception.
- Indeed, modern deep convolutional networks behave this way; yet they still can't match human performance (perhaps not sparse enough, not enough representational capability)
- Cite Hinton & Salakhutdinov 2006.
- The AND-OR or conv-pooling architecture was proposed by Hubbel and Weisel back in 1962! In their paper's formulatin, they call it a Normalized non-linear model, NLN.
- Nonlinearities tend to flatten object manifolds; even with random weights, NLN models tend to produce easier to decode object identities, based on strength of normalization. See also {714}.
- NLNs are tuned / become tuned to the statistics of real images. But they do not get into discrimination / perception thereof..
- NLNs learn temporally: inputs that occur temporally adjacent lead to similar responses.
- But: scaades? Humans saccade 100 million times per year!
- This could be seen as a continuity prior: the world is unlikely to change between saccades, so one can infer the identity and positions of objects on the retina, which say can be used to tune different retinotopic IT neurons..
- See Li & DiCarlo -- manipulation of image statistics changing visual responses.
- Regarding (3) above, perhaps attention is a modifier / learning gate?
|