{1446} revision 2 modified: 03-06-2019 04:31 gmt

PMID-29074582 A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs

  • Vicarious supplementary materials on their RCN (recursive cortical network).
  • Factor scene into shape and appearance, which CNN or DCNN do not do -- they conflate (ish? what about the style networks?)
    • They call this the coloring book approach -- extract shape then attach appearance.
  • Hierarchy of feature layers F frcF_{f r c} (binary) and pooling layer H frcH_{f r c} (multinomial), where f is feature, r is row, c is column (e.g. over image space).
  • Each layer is exclusively conditional on the layer above it, and all features in a layer are conditionally independent given the layer above.
  • Pool variables H frcH_{f r c} is multinomial, and each value associated with a feature, plus one off feature.
    • These features form a ‘pool’, which can/does have translation invariance.
  • If any of the pool variables are set to enable FF , then that feature is set (or-operation). Many pools can contain a given feature.
  • One can think of members of a pool as different alternatives of similar features.
  • Pools can be connected laterally, so each is dependent on the activity of its neighbors. This can be used to enforce edge continuity.
  • Each bottom-level feature corresponds to an edge, which defines ‘in’ and ‘out’ to define shape, YY .
  • These variables YY are also interconnected, and form a conditional random field, a ‘Potts model’. YY is generated by gibbs sampling given the F-H hierarchy above it.
  • Below Y, the per-pixel model X specifies texture with some conditional radial dependence.
  • The model amounts to a probabalistic model for which exact inference is impossible -- hence you must do approximate, where a bottom up pass estimates the category (with lateral connections turned off), and a top down estimates the object mask. Multiple passes can be done for multiple objects.
  • Model has a hard time moving from rgb pixels to edge ‘in’ and ‘out’; they use edge detection pre-processing stage, e.g. Gabor filter.
  • Training follows a very intuitive, hierarchical feature building heuristic, where if some object or collection of lower level features is not present, it’s added to the feature-pool tree.
    • This includes some winner-take-all heuristic for sparsification.
    • Also greedily learn some sort of feature ‘’dictionary’’ from individual unlabeled images.
  • Lateral connections are learned similarly, with a quasi-hebbian heuristic.
  • Neuroscience inspiration: see refs 9, 98 for message-passing based Bayesian inference.

  • Overall, a very heuristic, detail-centric, iteratively generated model and set of algorithms. You get the sense that this was really the work of Dileep George or only a few people; that it was generated by successively patching and improving the model/algo to make up for observed failures and problems.
    • As such, it offers little long-term vision for what is possible, or how perception and cognition occurs.
    • Instead, proof is shown that, well, engineering works, and the space of possible solutions -- including relatively simple elements like dictionaries and WTA -- is large and fecund.
      • Unclear how this will scale to even more complex real-world problems, where one would desire a solution that does not have to have each level carefully engineered.
      • Modern DCNN, at least, do not seem to have this property -- the structure is learned from the (alas, labeled) data.
  • This extends to the fact that yes, their purpose-built system achieves state of the art performance on the designated CAPATCHA tasks.
  • Check: B. M. Lake, R. Salakhutdinov, J. B. Tenenbaum, Human-level concept learning through probabilistic program induction. Science 350, 1332–1338 (2015). doi:10.1126/science.aab3050 Medline