You are not authenticated, login.
text: sort by
tags: modified
type: chronology
[0] Rózsa B, Katona G, Kaszás A, Szipöcs R, Vizi ES, Dendritic nicotinic receptors modulate backpropagating action potentials and long-term plasticity of interneurons.Eur J Neurosci 27:2, 364-77 (2008 Jan)

hide / / print
ref: -2019 tags: HSIC information bottleneck deep learning backprop gaussian kernel date: 10-06-2021 17:23 gmt revision:5 [4] [3] [2] [1] [0] [head]

The HSIC Bottleneck: Deep learning without Back-propagation

In this work, the authors use a kernelized estimate of statistical independence as part of a 'information bottleneck' to set per-layer objective functions for learning useful features in a deep network. They use the HSIC, or Hilbert-schmidt independence criterion, as the independence measure.

The information bottleneck was proposed by Bailek (spikes..) et al in 1999, and aims to increase the mutual information between the layer representation and the labels while minimizing the mutual information between the representation and the input:

minP T i|XI(X;T i)βI(T i;Y)\frac{min}{P_{T_i | X}} I(X; T_i) - \beta I(T_i; Y)

Where T iT_i is the hidden representation at layer i (later output), XX is the layer input, and YY are the labels. By replacing I()I() with the HSIC, and some derivation (?), they show that

HSIC(D)=(m1) 2tr(K XHK YH)HSIC(D) = (m-1)^{-2} tr(K_X H K_Y H)

Where D=(x 1,y 1),...(x m,y m)D = {(x_1,y_1), ... (x_m, y_m)} are samples and labels, K X ij=k(x i,x j)K_{X_{ij}} = k(x_i, x_j) and K Y ij=k(y i,y j)K_{Y_{ij}} = k(y_i, y_j) -- that is, it's the kernel function applied to all pairs of (vectoral) input variables. H is the centering matrix. The kernel is simply a Gaussian kernel, k(x,y)=exp(1/2||xy|| 2/σ 2)k(x,y) = exp(-1/2 ||x-y||^2/\sigma^2) . So, if all the x and y are on average independent, then the inner-product will be mean zero, the kernel will be mean one, and after centering will lead to zero trace. If the inner product is large within the realm of the derivative of the kernel, then the HSIC will be large (and negative, i think). In practice they use three different widths for their kernel, and they also center the kernel matrices.

But still, the feedback is an aggregate measure (the trace) of the product of two kernelized (a nonlinearity) outer-product spaces of similarities between inputs. it's not unimaginable that feedback networks could be doing something like this...

For example, a neural network could calculate & communicate aspects of joint statistics to reward / penalize weights within a layer of a network, and this is parallelizable / per layer / adaptable to an unsupervised learning regime. Indeed, that was done almost exactly by this paper: Kernelized information bottleneck leads to biologically plausible 3-factor Hebbian learning in deep networks albeit in a much less intelligible way.

Robust Learning with the Hilbert-Schmidt Independence Criterion

Is another, later, paper using the HSIC. Their interpretation: "This loss-function encourages learning models where the distribution of the residuals between the label and the model prediction is statistically independent of the distribution of the instances themselves." Hence, given above nomenclature, E X(P T i|XI(X;T i))=0 E_X( P_{T_i | X} I(X ; T_i) ) = 0 (I'm not totally sure about the weighting, but might be required given the definition of the HSIC.)

As I understand it, the HSIC loss is a kernellized loss between the input, output, and labels that encourages a degree of invariance to input ('covariate shift'). This is useful, but I'm unconvinced that making the layer output independent of the input is absolutely essential (??)

hide / / print
ref: -2019 tags: backprop neural networks deep learning coordinate descent alternating minimization date: 07-21-2021 03:07 gmt revision:1 [0] [head]

Beyond Backprop: Online Alternating Minimization with Auxiliary Variables

  • This paper is sort-of interesting: rather than back-propagating the errors, you optimize auxiliary variables, pre-nonlinearity 'codes' in a last-to-first layer order. The optimization is done to minimize a multimodal logistic loss function; math is not done to minimize other loss functions, but presumably this is not a limit. The loss function also includes a quadratic term on the weights.
  • After the 'codes' are set, optimization can proceed in parallel on the weights. This is done with either straight SGD or adaptive ADAM.
  • Weight L2 penalty is scheduled over time.

This is interesting in that the weight updates can be cone in parallel - perhaps more efficient - but you are still propagating errors backward, albeit via optimizing 'codes'. Given the vast infractructure devoted to auto-diff + backprop, I can't see this being adopted broadly.

That said, the idea of alternating minimization (which is used eg for EM clustering) is powerful, and this paper does describe (though I didn't read it) how there are guarantees on the convexity of the alternating minimization. Likewise, the authors show how to improve the performance of the online / minibatch algorithm by keeping around memory variables, in the form of covariance matrices.

hide / / print
ref: -0 tags: credit assignment distributed feedback alignment penn state MNIST fashion backprop date: 03-16-2019 02:21 gmt revision:1 [0] [head]

Conducting credit assignment by aligning local distributed representations

  • Alexander G. Ororbia, Ankur Mali, Daniel Kifer, C. Lee Giles
  • Propose two related algorithms: Local Representation Alignment (LRA)-diff and LRA-fdbk.
    • LRA-diff is basically a modified form of backprop.
    • LRA-fdbk is a modified version of feedback alignment. {1432} {1423}
  • Test on MNIST (easy -- many digits can be discriminated with one pixel!) and fashion-MNIST (harder -- humans only get about 85% right!)
  • Use a Cauchy or log-penalty loss at each layer, which is somewhat unique and interesting: L(z,y)= i=1 nlog(1+(y iz i) 2)L(z,y) = \sum_{i=1}^n{ log(1 + (y_i - z_i)^2)} .
    • This is hence a saturating loss.
  1. Normal multi-layer-perceptron feedforward network. pre activation h h^\ell and post activation z z^\ell are stored.
  2. Update the weights to minimize loss. This gradient calculation is identical to backprop, only they constrain the update to have a norm no bigger than c 1c_1 . Z and Y are actual and desired output of the layer, as commented. Gradient includes the derivative of the nonlinear activation function.
  3. Generaete update for the pre-nonlinearity h 1h^{\ell-1} to minimize the loss in the layer above. This again is very similar to backprop; its' the chain rule -- but the derivatives are vectors, of course, so those should be element-wise multiplication, not outer produts (i think).
    1. Note hh is updated -- derivatives of two nonlinearities.
  4. Feedback-alignment version, with random matrix E E_{\ell} (elements drawn from a gaussian distribution, σ=1\sigma = 1 ish.
    1. Only one nonlinearity derivative here -- bug?
  5. Move the rep and post activations in the specified gradient direction.
    1. Those h¯ 1\bar{h}^{\ell-1} variables are temporary holding -- but note that both lower and higher layers are updated.
  6. Do this K of times, K=1-50.
  • In practice K=1, with the LRA-fdbk algorithm, for the majority of the paper -- it works much better than LRA-diff (interesting .. bug?). Hence, this basically reduces to feedback alignment.
  • Demonstrate that LRA works much better with small initial weights, but basically because they tweak the algorithm to do this.
    • Need to see a positive control for this to be conclusive.
    • Again, why is FA so different from LRA-fdbk? Suspicious. Positive controls.
  • Attempted a network with Local Winner Take All (LWTA), which is a hard nonlinearity that LFA was able to account for & train through.
  • Also used Bernoulli neurons, and were able to successfully train. Unlike drop-out, these were stochastic at test time, and things still worked OK.

Lit review.
  • Logistic sigmoid can slow down learning, due to it's non-zero mean (Glorot & Bengio 2010).
  • Recirculation algorithm (or generalized recirculation) is a precursor for target propagation.
  • Target propagation is all about the inverse of the forward propagation: if we had access to the inverse of the network of forward propagations, we could compute which input values at the lower levels of the network would result in better values at the top that would please the global cost.
    • This is a very different way of looking at it -- almost backwards!
    • And indeed, it's not really all that different from contrastive divergence. (even though CD doesn't work well with non-Bernoulli units)
  • Contractive Hebbian learning also has two phases, one to fantasize, and done to try to make the fantasies look more like the input data.
  • Decoupled neural interfaces (Jaderberg et al 2016): learn a predictive model of error gradients (and inputs) nistead of trying to use local information to estimate updated weights.

  • Yeah, call me a critic, but I'm not clear on the contribution of this paper; it smells precocious and over-sold.
    • Even the title. I was hoping for something more 'local' than per-layer computation. BP does that already!
  • They primarily report supportive tests, not discriminative or stressing tests; how does the algorithm fail?
    • Certainly a lot of work went into it..
  • I still don't see how the computation of a target through a ransom matrix, then using delta/loss/error between that target and the feedforward activation to update weights, is much different than propagating the errors directly through a random feedback matrix. Eg. subtract then multiply, or multiply then subtract?

hide / / print
ref: -2019 tags: lillicrap google brain backpropagation through time temporal credit assignment date: 03-14-2019 20:24 gmt revision:2 [1] [0] [head]

PMID-22325196 Backpropagation through time and the brain

  • Timothy Lillicrap and Adam Santoro
  • Backpropagation through time: the 'canonical' expansion of backprop to assign credit in recurrent neural networks used in machine learning.
    • E.g. variable rol-outs, where the error is propagated many times through the recurrent weight matrix, W TW^T .
    • This leads to the exploding or vanishing gradient problem.
  • TCA = temporal credit assignment. What lead to this reward or error? How to affect memory to encourage or avoid this?
  • One approach is to simply truncate the error: truncated backpropagation through time (TBPTT). But this of course limits the horizon of learning.
  • The brain may do BPTT via replay in both the hippocampus and cortex Nat. Neuroscience 2007, thereby alleviating the need to retain long time histories of neuron activations (needed for derivative and credit assignment).
  • Less known method of TCA uses RTRL Real-time recurrent learning forward mode differentiation -- δh t/δθ\delta h_t / \delta \theta is computed and maintained online, often with synaptic weight updates being applied at each time step in which there is non-zero error. See A learning algorithm for continually running fully recurrent neural networks.
    • Big problem: A network with NN recurrent units requires O(N 3)O(N^3) storage and O(N 4)O(N^4) computation at each time-step.
    • Can be solved with Unbiased Online Recurrent optimization, which stores approximate but unbiased gradient estimates to reduce comp / storage.
  • Attention seems like a much better way of approaching the TCA problem: past events are stored externally, and the network learns a differentiable attention-alignment module for selecting these events.
    • Memory can be finite size, extending, or self-compressing.
    • Highlight the utility/necessity of content-addressable memory.
    • Attentional gating can eliminate the exploding / vanishing / corrupting gradient problems -- the gradient paths are skip-connections.
  • Biologically plausible: partial reactivation of CA3 memories induces re-activation of neocortical neurons responsible for initial encoding PMID-15685217 The organization of recent and remote memories. 2005

  • I remain reserved about the utility of thinking in terms of gradients when describing how the brain learns. Correlations, yes; causation, absolutely; credit assignment, for sure. Yet propagating gradients as a means for changing netwrok weights seems at best a part of the puzzle. So much of behavior and internal cognitive life involves explicit, conscious computation of cause and credit.
  • This leaves me much more sanguine about the use of external memory to guide behavior ... but differentiable attention? Hmm.

hide / / print
ref: -2019 tags: Arild Nokland local error signals backprop neural networks mnist cifar VGG date: 02-15-2019 03:15 gmt revision:6 [5] [4] [3] [2] [1] [0] [head]

Training neural networks with local error signals

  • Arild Nokland and Lars H Eidnes
  • Idea is to use one+ supplementary neural networks to measure within-batch matching loss between transformed hidden-layer output and one-hot label data to produce layer-local learning signals (gradients) for improving local representation.
  • Hence, no backprop. Error signals are all local, and inter-layer dependencies are not explicitly accounted for (! I think).
  • L simL_{sim} : given a mini-batch of hidden layer activations H=(h 1,...,h n)H = (h_1, ..., h_n) and a one-hot encoded label matrix Y=(y 1,...,y nY = (y_1, ..., y_n ,
    • L sim=||S(NeuralNet(H))S(Y)|| F 2 L_{sim} = || S(NeuralNet(H)) - S(Y)||^2_F (don't know what F is..)
    • NeuralNet()NeuralNet() is a convolutional neural net (trained how?) 3*3, stride 1, reduces output to 2.
    • S()S() is the cosine similarity matrix, or correlation matrix, of a mini-batch.
  • L pred=CrossEntropy(Y,W TH)L_{pred} = CrossEntropy(Y, W^T H) where W is a weight matrix, dim hidden_size * n_classes.
    • Cross-entropy is H(Y,W TH)=Σ i,jY i,jlog((W TH) i,j)+(1Y i,j)log(1(W TH) i,j) H(Y, W^T H) = \Sigma_{i,j} Y_{i,j} log((W^T H)_{i,j}) + (1-Y_{i,j}) log(1-(W^T H)_{i,j})
  • Sim-bio loss: replace NeuralNet()NeuralNet() with average-pooling and standard-deviation op. Plus one-hot target is replaced with a random transformation of the same target vector.
  • Overall loss 99% L simL_sim , 1% L predL_pred
    • Despite the unequal weighting, both seem to improve test prediction on all examples.
  • VGG like network, with dropout and cutout (blacking out square regions of input space), batch size 128.
  • Tested on all the relevant datasets: MNIST, Fashion-MNIST, Kuzushiji-MNIST, CIFAR-10, CIFAR-100, STL-10, SVHN.
  • Pretty decent review of similarity matching measures at the beginning of the paper; not extensive but puts everything in context.
    • See for example non-negative matrix factorization using Hebbian and anti-Hebbian learning in and Chklovskii 2014.
  • Emphasis put on biologically realistic learning, including the use of feedback alignment {1423}
    • Yet: this was entirely supervised learning, as the labels were propagated back to each layer.
    • More likely that biology is setup to maximize available labels (not a new concept).

hide / / print
ref: -2014 tags: Lillicrap Random feedback alignment weights synaptic learning backprop MNIST date: 02-14-2019 01:02 gmt revision:5 [4] [3] [2] [1] [0] [head]

PMID-27824044 Random synaptic feedback weights support error backpropagation for deep learning.

  • "Here we present a surprisingly simple algorithm for deep learning, which assigns blame by multiplying error signals by a random synaptic weights.
  • Backprop multiplies error signals e by the weight matrix W T W^T , the transpose of the forward synaptic weights.
  • But the feedback weights do not need to be exactly W T W^T ; any matrix B will suffice, so long as on average:
  • e TWBe>0 e^T W B e > 0
    • Meaning that the teaching signal Be B e lies within 90deg of the signal used by backprop, W Te W^T e
  • Feedback alignment actually seems to work better than backprop in some cases. This relies on starting the weights very small (can't be zero -- no output)

Our proof says that weights W0 and W
evolve to equilibrium manifolds, but simulations (Fig. 4) and analytic results (Supple-
mentary Proof 2) hint at something more specific: that when the weights begin near
0, feedback alignment encourages W to act like a local pseudoinverse of B around
the error manifold. This fact is important because if B were exactly W + (the Moore-
Penrose pseudoinverse of W ), then the network would be performing Gauss-Newton
optimization (Supplementary Proof 3). We call this update rule for the hidden units
pseudobackprop and denote it by ∆hPBP = W + e. Experiments with the linear net-
work show that the angle, ∆hFA ]∆hPBP quickly becomes smaller than ∆hFA ]∆hBP
(Fig. 4b, c; see Methods). In other words feedback alignment, despite its simplicity,
displays elements of second-order learning.

hide / / print
ref: -0 tags: lillicrap segregated dendrites deep learning backprop date: 01-31-2019 19:24 gmt revision:2 [1] [0] [head]

PMID-29205151 Towards deep learning with segregated dendrites https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5716677/

  • Much emphasis on the problem of credit assignment in biological neural networks.
    • That is: given complex behavior, how do upstream neurons change to improve the task of downstream neurons?
    • Or: given downstream neurons, how do upstream neurons receive ‘credit’ for informing behavior?
      • I find this a very limiting framework, and is one of my chief beefs with the work.
      • Spatiotemporal Bayesian structure seems like a much better axis (axes) to cast function against.
      • Or, it could be segregation into ‘signal’ and ‘error’ or ‘figure/ground’ based on hierarchical spatio-temporal statistical properties that matters ...
      • ... with proper integration of non-stochastic spike timing + neoSTDP.
        • This still requires some solution of the credit-assignment problem, i know i know.
  • Outline a spiking neuron model with zero one or two hidden layers, and a segregated apical (feedback) and basal (feedforward) dendrites, as per a layer 5 pyramidal neuron.
  • The apical dendrites have plateau potentials, which are stimulated through (random) feedback weights from the output neurons.
  • Output neurons are forced to one-hot activation at maximum firing rate during training.
    • In order to assign credit, feedforward information must be integrated separately from any feedback signals used to calculate error for synaptic updates (the error is indicated here with δ). (B) Illustration of the segregated dendrites proposal. Rather than using a separate pathway to calculate error based on feedback, segregated dendritic compartments could receive feedback and calculate the error signals locally.
  • Uses the MNIST database, naturally.
  • Poisson spiking input neurons, 784, again natch.
  • Derive local loss function learning rules to make the plateau potential (from the feedback weights) match the feedforward potential
    • This encourages the hidden layer -> output layer to approximate the inverse of the random feedback weight network -- which it does! (At least, the jacobians are inverses of each other).
    • The matching is performed in two phases -- feedforward and feedback. This itself is not biologically implausible, just unlikely.
  • Achieved moderate performance on MNIST, ~ 4%, which improved with 2 hidden layers.
  • Very good, interesting scholarship on the relevant latest findings ‘’in vivo’’.
  • While the model seems workable though ad-hoc or just-so, the scholarship points to something better: use of multiple neuron subtypes to accomplish different elements (variables) in the random-feedback credit assignment algorithm.
    • These small models can be tuned to do this somewhat simple task through enough fiddling & manual (e.g. in the algorithmic space, not weight space) backpropagation of errors.
  • They suggest that the early phases of learning may entail learning the feedback weights -- fascinating.
  • ‘’Things are definitely moving forward’’.

hide / / print
ref: Harris-2008.03 tags: retroaxonal retrosynaptic Harris learning cortex backprop date: 12-07-2011 02:34 gmt revision:2 [1] [0] [head]

PMID-18255165[0] Stability of the fittest: organizing learning through retroaxonal signals

  • the central hypothesis: strengthening of a neuron's output synapses stabilizes recent changes in the same neuron's inputs.
    • this causes representations (as are arrived at with backprop) that are tuned to task features.
  • Retroaxonal signaling in the brain is too slow for an instructive (says at least the sign of the error wrt a current neuron's output) backprop algorithm
  • hence, retroaxonal signals are not instructive but selective.
  • At SFN Harris was looking for people to test this in a model; as (yet) unmodeled and untested, I'm suspicious of it.
  • Seems plausible, yet it also just seems to be a way of moving the responsibility for learning computation to the postsynaptic neuron (which is then propagated back to the present neuron). The theory does not immediately suggest what neurons are doing to learn their stuff; rather how they may be learning.
    • If this stabilization is based on some sort of feedback (attention? reward?), which may guide learning (except for the cortex, which does not have many (any?) DA receptors...), then I may be more willing to accept it.
    • It seems likely that the cortex is doing a lot of unsupervised learning: predicting what sensory info will come next based on present sensory info (ICA, PCA).


[0] Harris KD, Stability of the fittest: organizing learning through retroaxonal signals.Trends Neurosci 31:3, 130-6 (2008 Mar)

hide / / print
ref: -0 tags: backpropagation cascade correlation neural networks date: 12-20-2010 06:28 gmt revision:1 [0] [head]

The Cascade-Correlation Learning Architecture

  • Much better - much more sensible, computationally cheaper, than backprop.
  • Units are added one by one; each is trained to be maximally correlated to the error of the existing, frozen neural network.
  • Uses quickprop to speed up gradient ascent learning.

hide / / print
ref: RAzsa-2008.01 tags: nAChR nicotinic acetylchoine receptor interneurons backpropagating LTP hippocampus date: 10-08-2008 17:37 gmt revision:0 [head]

PMID-18215234[0] Dendritic nicotinic receptors modulate backpropagating action potentials and long-term plasticity of interneurons.

  • idea: nAChRs are highly permeable to Ca+2, LTP is dependent on Ca2+, so they tested nAChR -> LTP in interneurons of rat hippocampus using whole-cell electrophysiology and 2-photon imaging.
  • Here we show that precisely timed activation of dendritic α7-nAChRs boosts the induction of LTP by excitatory postsynaptic potentials (EPSPs) and synaptically triggered dendritic Ca2+ transients.
  • suggest that this rapid (ionotropic) method of memory encoding and retrieval via LTP/D facilitated by acetylcholine.
  • I haven't read much of the article, since it is much out of my field of experience.