use https for features.
text: sort by
tags: modified
type: chronology
hide / / print
ref: -2017 tags: schema networks reinforcement learning atari breakout vicarious date: 09-29-2020 02:32 gmt revision:2 [1] [0] [head]

Schema networks: zero-shot transfer with a generative causal model of intuitive physics

  • Like a lot of papers, the title has more flash than the actual results.
  • Results which would be state of the art (as of 2017) in playing Atari breakout, then transferring performance to modifications of the game (paddle moved up a bit, wall added in the middle of the bricks, brick respawning, juggling).
  • Schema network is based on 'entities' (objects) which have binary 'attributes'. These attributes can include continuous-valued signals, in which case each binary variable is like a place fields (i think).
    • This is clever an interesting -- rather than just low-level features pointing to high-level features, this means that high-level entities can have records of low-level features -- an arrow pointing in the opposite direction, one which can (also) be learned.
    • The same idea is present in other Vicarious work, including the CAPTCHA paper and more-recent (and less good) Bio-RNN paper.
  • Entities and attributes are propagated forward in time based on 'ungrounded schemas' -- basically free-floating transition matrices. The grounded schemas are entities and action groups that have evidence in observation.
    • There doesn't seem to be much math describing exactly how this works; only exposition. Or maybe it's all hand-waving over the actual, much simpler math.
      • Get the impression that the authors are reaching to a level of formalism when in fact they just made something that works for the breakout task... I infer Dileep prefers the empirical for the formal, so this is likely primarily the first author.
  • There are no perceptual modules here -- game state is fed to the network directly as entities and attributes (and, to be fair, to the A3C model).
  • Entity-attributes vectors are concatenated into a column vector length NTNT , where NN are the number of entities, and TT are time slices.
    • For each entity of N over time T, a row-vector is made of length MRMR , where MM are the number of attributes (fixed per task) and R1R-1 are the number of neighbors in a fixed radius. That is, each entity is related to its neighbors attributes over time.
    • This is a (large, sparse) binary matrix, XX .
  • yy is the vector of actions; task is to predict actions from XX .
    • How is X learned?? Very unclear in the paper vs. figure 2.
  • The solution is approximated as y=XW1¯y = X W \bar{1 } where WW is a binary weight matrix.
    • Minimize the solution based on an objective function on the error and the complexity of ww .
    • This is found via linear programming relaxation. "This procedure monotonically decreases the prediction error of the overall schema network, while increasing its complexity".
      • As it's a issue of binary conjunctions, this seems like a SAT problem!
    • Note that it's not probabilistic: "For this algorithm to work, no contradictions can exist in the input data" -- they instead remove them!
  • Actual behavior includes maximum-product belief propagation, to look for series of transitions that set the reward variable without setting the fail variable.
    • Because the network is loopy, this has to occur several times to set entity variables eg & includes backtracking.

  • Have there been any further papers exploring schema networks? What happened to this?
  • The later paper from Vicarious on zero-shot task transfer are rather less interesting (to me) than this.

hide / / print
ref: -2017 tags: vicarious dileep george captcha message passing inference heuristic network date: 03-06-2019 04:31 gmt revision:2 [1] [0] [head]

PMID-29074582 A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs

  • Vicarious supplementary materials on their RCN (recursive cortical network).
  • Factor scene into shape and appearance, which CNN or DCNN do not do -- they conflate (ish? what about the style networks?)
    • They call this the coloring book approach -- extract shape then attach appearance.
  • Hierarchy of feature layers F frcF_{f r c} (binary) and pooling layer H frcH_{f r c} (multinomial), where f is feature, r is row, c is column (e.g. over image space).
  • Each layer is exclusively conditional on the layer above it, and all features in a layer are conditionally independent given the layer above.
  • Pool variables H frcH_{f r c} is multinomial, and each value associated with a feature, plus one off feature.
    • These features form a ‘pool’, which can/does have translation invariance.
  • If any of the pool variables are set to enable FF , then that feature is set (or-operation). Many pools can contain a given feature.
  • One can think of members of a pool as different alternatives of similar features.
  • Pools can be connected laterally, so each is dependent on the activity of its neighbors. This can be used to enforce edge continuity.
  • Each bottom-level feature corresponds to an edge, which defines ‘in’ and ‘out’ to define shape, YY .
  • These variables YY are also interconnected, and form a conditional random field, a ‘Potts model’. YY is generated by gibbs sampling given the F-H hierarchy above it.
  • Below Y, the per-pixel model X specifies texture with some conditional radial dependence.
  • The model amounts to a probabalistic model for which exact inference is impossible -- hence you must do approximate, where a bottom up pass estimates the category (with lateral connections turned off), and a top down estimates the object mask. Multiple passes can be done for multiple objects.
  • Model has a hard time moving from rgb pixels to edge ‘in’ and ‘out’; they use edge detection pre-processing stage, e.g. Gabor filter.
  • Training follows a very intuitive, hierarchical feature building heuristic, where if some object or collection of lower level features is not present, it’s added to the feature-pool tree.
    • This includes some winner-take-all heuristic for sparsification.
    • Also greedily learn some sort of feature ‘’dictionary’’ from individual unlabeled images.
  • Lateral connections are learned similarly, with a quasi-hebbian heuristic.
  • Neuroscience inspiration: see refs 9, 98 for message-passing based Bayesian inference.

  • Overall, a very heuristic, detail-centric, iteratively generated model and set of algorithms. You get the sense that this was really the work of Dileep George or only a few people; that it was generated by successively patching and improving the model/algo to make up for observed failures and problems.
    • As such, it offers little long-term vision for what is possible, or how perception and cognition occurs.
    • Instead, proof is shown that, well, engineering works, and the space of possible solutions -- including relatively simple elements like dictionaries and WTA -- is large and fecund.
      • Unclear how this will scale to even more complex real-world problems, where one would desire a solution that does not have to have each level carefully engineered.
      • Modern DCNN, at least, do not seem to have this property -- the structure is learned from the (alas, labeled) data.
  • This extends to the fact that yes, their purpose-built system achieves state of the art performance on the designated CAPATCHA tasks.
  • Check: B. M. Lake, R. Salakhutdinov, J. B. Tenenbaum, Human-level concept learning through probabilistic program induction. Science 350, 1332–1338 (2015). doi:10.1126/science.aab3050 Medline