Assessing the Scalability of BiologicallyMotivated Deep Learning Algorithms and Architectures
 Sergey Bartunov, Adam Santoro, Blake A. Richards, Luke Marris, Geoffrey E. Hinton, Timothy Lillicrap
 As is known, many algorithms work well on MNIST, but fail on more complicated tasks, like CIFAR and ImageNet.
 In their experiments, backprop still fares better than any of the biologically inspired / biologically plausible learning rules. This includes:
 Feedback alignment {1432} {1423}
 Vanilla target propagation
 Problem: with convergent networks, layer inverses (topdown) will map all items of the same class to one target vector in each layer, which is very limiting.
 Hence this algorithm was not directly investigated.
 Difference target propagation (2015)
 Uses the perlayer target as $\hat{h}_l = g(\hat{h}_{l+1}; \lambda_{l+1}) + [h_l  g(h_{l+1};\lambda_{l+1})]$
 Or: $\hat{h}_l = h_l + g(\hat{h}_{l+1}; \lambda_{l+1})  g(h_{l+1};\lambda_{l+1})$ where $\lambda_{l}$ are the parameters for the inverse model; $g()$ is the sum and nonlinearity.
 That is, the target is modified ala delta rule by the difference between inversepropagated higher layer target and inversepropagated higher level activity.
 Why? $h_{l}$ should approach $\hat{h}_{l}$ as $h_{l+1}$ approaches $\hat{h}_{l+1}$ .
 Otherwise, the parameters in lower layers continue to be updated even when low loss is reached in the upper layers. (from original paper).
 The last to penultimate layer weights is trained via backprop to prevent template impoverishment as noted above.
 Simplified difference target propagation
 The substitute a biologically plausible learning rule for the penultimate layer,
 $\hat{h}_{L1} = h_{L1} + g(\hat{h}_L;\lambda_L)  g(h_L;\lambda_L)$ where there are $L$ layers.
 It's the same rule as the other layers.
 Hence subject to impoverishment problem with lowentropy labels.
 Auxiliary output simplified difference target propagation
 Add a vector $z$ to the last layer activation, which carries information about the input vector.
 $z$ is just a set of random features from the activation $h_{L1}$ .
 Used both fully connected and locallyconnected (e.g. convolution without weight sharing) MLP.

 It's not so great:

 Target propagation seems like a weak learner, worse than feedback alignment; not only is the feedback limited, but it does not take advantage of the statistics of the input.
 Hence, some of these schemes may work better when combined with unsupervised learning rules.
 Still, in the original paper they use differencetarget propagation with autoencoders, and get reasonable stroke features..
 Their general result that networks and learning rules need to be tested on more difficult tasks rings true, and might well be the main point of this otherwise meh paper.
