Conducting credit assignment by aligning local distributed representations
 Alexander G. Ororbia, Ankur Mali, Daniel Kifer, C. Lee Giles
 Propose two related algorithms: Local Representation Alignment (LRA)diff and LRAfdbk.
 LRAdiff is basically a modified form of backprop.
 LRAfdbk is a modified version of feedback alignment. {1432} {1423}
 Test on MNIST (easy  many digits can be discriminated with one pixel!) and fashionMNIST (harder  humans only get about 85% right!)
 Use a Cauchy or logpenalty loss at each layer, which is somewhat unique and interesting: $L(z,y) = \sum_{i=1}^n{ log(1 + (y_i  z_i)^2)}$ .
 This is hence a saturating loss.

 Normal multilayerperceptron feedforward network. pre activation $h^\ell$ and post activation $z^\ell$ are stored.
 Update the weights to minimize loss. This gradient calculation is identical to backprop, only they constrain the update to have a norm no bigger than $c_1$ . Z and Y are actual and desired output of the layer, as commented. Gradient includes the derivative of the nonlinear activation function.
 Generaete update for the prenonlinearity $h^{\ell1}$ to minimize the loss in the layer above. This again is very similar to backprop; its' the chain rule  but the derivatives are vectors, of course, so those should be elementwise multiplication, not outer produts (i think).
 Note $h$ is updated  derivatives of two nonlinearities.
 Feedbackalignment version, with random matrix $E_{\ell}$ (elements drawn from a gaussian distribution, $\sigma = 1$ ish.
 Only one nonlinearity derivative here  bug?
 Move the rep and post activations in the specified gradient direction.
 Those $\bar{h}^{\ell1}$ variables are temporary holding  but note that both lower and higher layers are updated.
 Do this K of times, K=150.
 In practice K=1, with the LRAfdbk algorithm, for the majority of the paper  it works much better than LRAdiff (interesting .. bug?). Hence, this basically reduces to feedback alignment.
 Demonstrate that LRA works much better with small initial weights, but basically because they tweak the algorithm to do this.
 Need to see a positive control for this to be conclusive.
 Again, why is FA so different from LRAfdbk? Suspicious. Positive controls.
 Attempted a network with Local Winner Take All (LWTA), which is a hard nonlinearity that LFA was able to account for & train through.
 Also used Bernoulli neurons, and were able to successfully train. Unlike dropout, these were stochastic at test time, and things still worked OK.
Lit review.
 Logistic sigmoid can slow down learning, due to it's nonzero mean (Glorot & Bengio 2010).
 Recirculation algorithm (or generalized recirculation) is a precursor for target propagation.
 Target propagation is all about the inverse of the forward propagation: if we had access to the inverse of the network of forward propagations, we could compute which input values at the lower levels of the network would result in better values at the top that would please the global cost.
 This is a very different way of looking at it  almost backwards!
 And indeed, it's not really all that different from contrastive divergence. (even though CD doesn't work well with nonBernoulli units)
 Contractive Hebbian learning also has two phases, one to fantasize, and done to try to make the fantasies look more like the input data.
 Decoupled neural interfaces (Jaderberg et al 2016): learn a predictive model of error gradients (and inputs) nistead of trying to use local information to estimate updated weights.
 Yeah, call me a critic, but I'm not clear on the contribution of this paper; it smells precocious and oversold.
 Even the title. I was hoping for something more 'local' than perlayer computation. BP does that already!
 They primarily report supportive tests, not discriminative or stressing tests; how does the algorithm fail?
 Certainly a lot of work went into it..
 I still don't see how the computation of a target through a ransom matrix, then using delta/loss/error between that target and the feedforward activation to update weights, is much different than propagating the errors directly through a random feedback matrix. Eg. subtract then multiply, or multiply then subtract?

PMID22325196 Backpropagation through time and the brain
 Timothy Lillicrap and Adam Santoro
 Backpropagation through time: the 'canonical' expansion of backprop to assign credit in recurrent neural networks used in machine learning.
 E.g. variable rolouts, where the error is propagated many times through the recurrent weight matrix, $W^T$ .
 This leads to the exploding or vanishing gradient problem.
 TCA = temporal credit assignment. What lead to this reward or error? How to affect memory to encourage or avoid this?
 One approach is to simply truncate the error: truncated backpropagation through time (TBPTT). But this of course limits the horizon of learning.
 The brain may do BPTT via replay in both the hippocampus and cortex Nat. Neuroscience 2007, thereby alleviating the need to retain long time histories of neuron activations (needed for derivative and credit assignment).
 Less known method of TCA uses RTRL Realtime recurrent learning forward mode differentiation  $\delta h_t / \delta \theta$ is computed and maintained online, often with synaptic weight updates being applied at each time step in which there is nonzero error. See A learning algorithm for continually running fully recurrent neural networks.
 Big problem: A network with $N$ recurrent units requires $O(N^3)$ storage and $O(N^4)$ computation at each timestep.
 Can be solved with Unbiased Online Recurrent optimization, which stores approximate but unbiased gradient estimates to reduce comp / storage.
 Attention seems like a much better way of approaching the TCA problem: past events are stored externally, and the network learns a differentiable attentionalignment module for selecting these events.
 Memory can be finite size, extending, or selfcompressing.
 Highlight the utility/necessity of contentaddressable memory.
 Attentional gating can eliminate the exploding / vanishing / corrupting gradient problems  the gradient paths are skipconnections.
 Biologically plausible: partial reactivation of CA3 memories induces reactivation of neocortical neurons responsible for initial encoding PMID15685217 The organization of recent and remote memories. 2005
 I remain reserved about the utility of thinking in terms of gradients when describing how the brain learns. Correlations, yes; causation, absolutely; credit assignment, for sure. Yet propagating gradients as a means for changing netwrok weights seems at best a part of the puzzle. So much of behavior and internal cognitive life involves explicit, conscious computation of cause and credit.
 This leaves me much more sanguine about the use of external memory to guide behavior ... but differentiable attention? Hmm.
