{1441} revision 5 modified: 03-15-2019 05:51 gmt

Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures

  • Sergey Bartunov, Adam Santoro, Blake A. Richards, Luke Marris, Geoffrey E. Hinton, Timothy Lillicrap
  • As is known, many algorithms work well on MNIST, but fail on more complicated tasks, like CIFAR and ImageNet.
  • In their experiments, backprop still fares better than any of the biologically inspired / biologically plausible learning rules. This includes:
    • Feedback alignment {1432} {1423}
    • Vanilla target propagation
      • Problem: with convergent networks, layer inverses (top-down) will map all items of the same class to one target vector in each layer, which is very limiting.
      • Hence this algorithm was not directly investigated.
    • Difference target propagation (2015)
      • Uses the per-layer target as h^ l=g(h^ l+1;λ l+1)+[h lg(h l+1;λ l+1)]\hat{h}_l = g(\hat{h}_{l+1}; \lambda_{l+1}) + [h_l - g(h_{l+1};\lambda_{l+1})]
      • Or: h^ l=h l+g(h^ l+1;λ l+1)g(h l+1;λ l+1)\hat{h}_l = h_l + g(\hat{h}_{l+1}; \lambda_{l+1}) - g(h_{l+1};\lambda_{l+1}) where λ l\lambda_{l} are the parameters for the inverse model; g()g() is the sum and nonlinearity.
      • That is, the target is modified ala delta rule by the difference between inverse-propagated higher layer target and inverse-propagated higher level activity.
        • Why? h lh_{l} should approach h^ l\hat{h}_{l} as h l+1h_{l+1} approaches h^ l+1\hat{h}_{l+1} .
        • Otherwise, the parameters in lower layers continue to be updated even when low loss is reached in the upper layers. (from original paper).
      • The last to penultimate layer weights is trained via backprop to prevent template impoverishment as noted above.
    • Simplified difference target propagation
      • The substitute a biologically plausible learning rule for the penultimate layer,
      • h^ L1=h L1+g(h^ L;λ L)g(h L;λ L)\hat{h}_{L-1} = h_{L-1} + g(\hat{h}_L;\lambda_L) - g(h_L;\lambda_L) where there are LL layers.
      • It's the same rule as the other layers.
      • Hence subject to impoverishment problem with low-entropy labels.
    • Auxiliary output simplified difference target propagation
      • Add a vector zz to the last layer activation, which carries information about the input vector.
      • zz is just a set of random features from the activation h L1h_{L-1} .
  • Used both fully connected and locally-connected (e.g. convolution without weight sharing) MLP.
  • It's not so great:
  • Target propagation seems like a weak learner, worse than feedback alignment; not only is the feedback limited, but it does not take advantage of the statistics of the input.
    • Hence, some of these schemes may work better when combined with unsupervised learning rules.
    • Still, in the original paper they use difference-target propagation with autoencoders, and get reasonable stroke features..
  • Their general result that networks and learning rules need to be tested on more difficult tasks rings true, and might well be the main point of this otherwise meh paper.