Conducting credit assignment by aligning local distributed representations
 Alexander G. Ororbia, Ankur Mali, Daniel Kifer, C. Lee Giles
 Propose two related algorithms: Local Representation Alignment (LRA)diff and LRAfdbk.
 LRAdiff is basically a modified form of backprop.
 LRAfdbk is a modified version of feedback alignment. {1432} {1423}
 Test on MNIST (easy  many digits can be discriminated with one pixel!) and fashionMNIST (harder  humans only get about 85% right!)
 Use a Cauchy or logpenalty loss at each layer, which is somewhat unique and interesting: $L(z,y) = \sum_{i=1}^n{ log(1 + (y_i  z_i)^2)}$ .
 This is hence a saturating loss.

 Normal multilayerperceptron feedforward network. pre activation $h^\ell$ and post activation $z^\ell$ are stored.
 Update the weights to minimize loss. This gradient calculation is identical to backprop, only they constrain the update to have a norm no bigger than $c_1$ . Z and Y are actual and desired output of the layer, as commented. Gradient includes the derivative of the nonlinear activation function.
 Generaete update for the prenonlinearity $h^{\ell1}$ to minimize the loss in the layer above. This again is very similar to backprop; its' the chain rule  but the derivatives are vectors, of course, so those should be elementwise multiplication, not outer produts (i think).
 Note $h$ is updated  derivatives of two nonlinearities.
 Feedbackalignment version, with random matrix $E_{\ell}$ (elements drawn from a gaussian distribution, $\sigma = 1$ ish.
 Only one nonlinearity derivative here  bug?
 Move the rep and post activations in the specified gradient direction.
 Those $\bar{h}^{\ell1}$ variables are temporary holding  but note that both lower and higher layers are updated.
 Do this K of times, K=150.
 In practice K=1, with the LRAfdbk algorithm, for the majority of the paper  it works much better than LRAdiff (interesting .. bug?). Hence, this basically reduces to feedback alignment.
 Demonstrate that LRA works much better with small initial weights, but basically because they tweak the algorithm to do this.
 Need to see a positive control for this to be conclusive.
 Again, why is FA so different from LRAfdbk? Suspicious. Positive controls.
 Attempted a network with Local Winner Take All (LWTA), which is a hard nonlinearity that LFA was able to account for & train through.
 Also used Bernoulli neurons, and were able to successfully train. Unlike dropout, these were stochastic at test time, and things still worked OK.
Lit review.
 Logistic sigmoid can slow down learning, due to it's nonzero mean (Glorot & Bengio 2010).
 Recirculation algorithm (or generalized recirculation) is a precursor for target propagation.
 Target propagation is all about the inverse of the forward propagation: if we had access to the inverse of the network of forward propagations, we could compute which input values at the lower levels of the network would result in better values at the top that would please the global cost.
 This is a very different way of looking at it  almost backwards!
 And indeed, it's not really all that different from contrastive divergence. (even though CD doesn't work well with nonBernoulli units)
 Contractive Hebbian learning also has two phases, one to fantasize, and done to try to make the fantasies look more like the input data.
 Decoupled neural interfaces (Jaderberg et al 2016): learn a predictive model of error gradients (and inputs) nistead of trying to use local information to estimate updated weights.
 Yeah, call me a critic, but I'm not clear on the contribution of this paper; it smells precocious and oversold.
 Even the title. I was hoping for something more 'local' than perlayer computation. BP does that already!
 They primarily report supportive tests, not discriminative or stressing tests; how does the algorithm fail?
 Certainly a lot of work went into it..
 I still don't see how the computation of a target through a ransom matrix, then using delta/loss/error between that target and the feedforward activation to update weights, is much different than propagating the errors directly through a random feedback matrix. Eg. subtract then multiply, or multiply then subtract?
