You are not authenticated, login.
text: sort by
tags: modified
type: chronology
hide / / print
ref: -2020 tags: feedback alignment local hebbian learning rules stanford date: 04-22-2021 03:26 gmt revision:0 [head]

Two Routes to Scalable Credit Assignment without Weight Symmetry

This paper looks at five different learning rules, three purely local, and two non-local, to see if they can work as well as backprop in training a deep convolutional net on ImageNet. The local learning networks all feature forward weights W and backward weights B; the forward weights (+ nonlinearities) pass the information to lead to a classification; the backward weights pass the error, which is used to locally adjust the forward weights.

Hence, each fake neuron has locally the forward activation, the backward error (or loss gradient), the forward weight, backward weight, and Hebbian terms thereof (e.g the outer product of the in-out vectors for both forward and backward passes). From these available variables, they construct the local learning rules:

  • Decay (exponentially decay the backward weights)
  • Amp (Hebbian learning)
  • Null (decay based on the product of the weight and local activation. This effects a Euclidean norm on reconstruction.

Each of these serves as a "regularizer term" on the feedback weights, which governs their learning dynamics. In the case of backprop, the backward weights B are just the instantaneous transpose of the forward weights W. A good local learning rule approximates this transpose progressively. They show that, with proper hyperparameter setting, this does indeed work nearly as well as backprop when training a ResNet-18 network.

But, hyperparameter settings don't translate to other network topologies. To allow this, they add in non-local learning rules:

  • Sparse (penalizes the Euclidean norm of the previous layer; gradient is the outer product of the (current layer activation &transpose) * B)
  • Self (directly measures the forward weights and uses them to update the backward weights)

In "Symmetric Alignment", the Self and Decay rules are employed. This is similar to backprop (the backward weights will track the forward ones) with L2 regularization, which is not new. It performs very similarly to backprop. In "Activation Alignment", Amp and Sparse rules are employed. I assume this is supposed to be more biologically plausible -- the Hebbian term can track the forward weights, while the Sparse rule regularizes and stabilizes the learning, such that overall dynamics allow the gradient to flow even if W and B aren't transposes of each other.

Surprisingly, they find that Symmetric Alignment to be more robust to the injection of Gaussian noise during training than backprop. Both SA and AA achieve similar accuracies on the ResNet benchmark. The authors then go on to explain the plausibility of non-local but approximate learning rules with Regression discontinuity design ala Spiking allows neurons to estimate their causal effect.

This is a decent paper,reasonably well written. They thought trough what variables are available to affect learning, and parameterized five combinations that work. Could they have done the full matrix of combinations, optimizing just they same as the metaparameters? Perhaps, but that would be even more work ...

Regarding the desire to reconcile backprop and biology, this paper does not bring us much (if at all) closer. Biological neural networks have specific and local uses for error; even invoking 'error' has limited explanatory power on activity. Learning and firing dynamics, of course of course. Is the brain then just an overbearing mess of details and overlapping rules? Yes probably but that doesn't mean that we human's can't find something simpler that works. The algorithms in this paper, for example, are well described by a bit of linear algebra, and yet they are performant.

hide / / print
ref: -2013 tags: synaptic learning rules calcium harris stdp date: 02-18-2021 19:48 gmt revision:3 [2] [1] [0] [head]

PMID-24204224 The Convallis rule for unsupervised learning in cortical networks 2013 - Pierre Yger  1 , Kenneth D Harris

This paper aims to unify and reconcile experimental evidence of in-vivo learning rules with  established STDP rules.  In particular, the STDP rule fails to accurately predict change in strength in response to spike triplets, e.g. pre-post-pre or post-pre-post.  Their model instead involves the competition between two time-constant threshold circuits / coincidence detectors, one which controls LTD and another LTP, and is such an extension of the classical BCM rule.  (BCM: inputs below a threshold will weaken a synapse; those above it will strengthen. )

They derive the model from optimization criteria that neurons should try to optimize the skewedness of the distribution of their membrane potential: much time spent either firing spikes or strongly inhibited.  This maps to a objective function F that looks like a valley - hence the 'convallis' in the name (latin for valley); the objective is differentiated to yield a weighting function for weight changes; they also add a shrinkage function (line + heaviside function) to gate weight changes 'off' at resting membrane potential. 

A network of firing neurons successfully groups correlated rate-encoded inputs, better than the STDP rule.  it can also cluster auditory inputs of spoken digits converted into cochleogram.  But this all seems relatively toy-like: of course algorithms can associate inputs that co-occur.  The same result was found for a recurrent balanced E-I network with the same cochleogram, and convalis performed better than STDP.   Meh.

Perhaps the biggest thing I got from the paper was how poorly STDP fares with spike triplets:

Pre following post does not 'necessarily' cause LTD; it's more complicated than that, and more consistent with the two different-timeconstant coincidence detectors.  This is satisfying as it allows for apical dendritic depolarization to serve as a contextual binding signal - without negatively impacting the associated synaptic weights. 

hide / / print
ref: -0 tags: multifactor synaptic learning rules date: 01-22-2020 01:45 gmt revision:9 [8] [7] [6] [5] [4] [3] [head]

Why multifactor?

  • Take a simple MLP. Let xx be the layer activation. X 0X^0 is the input, X 1X^1 is the second layer (first hidden layer). These are vectors, indexed like x i ax^a_i .
  • Then X 1=WX 0X^1 = W X^0 or x j 1=ϕ(Σ i=1 Nw ijx i 0)x^1_j = \phi(\Sigma_{i=1}^N w_{ij} x^0_i) . ϕ\phi is the nonlinear activation function (ReLU, sigmoid, etc.)
  • In standard STDP the learning rule follows Δwf(x pre(t),x post(t)) \Delta w \propto f(x_{pre}(t), x_{post}(t)) or if layer number is aa Δw a+1f(x a(t),x a+1(t))\Delta w^{a+1} \propto f(x^a(t), x^{a+1}(t))
    • (but of course nobody thinks there 'numbers' on the 'layers' of the brain -- this is just referring to pre and post synaptic).
  • In an artificial neural network, Δw aEw ij aδ j ax i \Delta w^a \propto - \frac{\partial E}{\partial w_{ij}^a} \propto - \delta_{j}^a x_{i} (Intuitively: the weight change is proportional to the error propagated from higher layers times the input activity) where δ j a=(Σ k=1 Nw jkδ k a+1)ϕ \delta_{j}^a = (\Sigma_{k=1}^{N} w_{jk} \delta_k^{a+1}) \partial \phi where ϕ\partial \phi is the derivative of the nonlinear activation function, evaluated at a given activation.
  • f(i,j)[x,y,θ,ϕ] f(i, j) \rightarrow [x, y, \theta, \phi]
  • k=13.165 k = 13.165
  • x=round(i/k) x = round(i / k)
  • y=round(j/k) y = round(j / k)
  • θ=a(ikx)+b(ikx) 2 \theta = a (\frac{i}{k} - x) + b (\frac{i}{k} - x)^2
  • ϕ=a(jky)+b(jky) 2 \phi = a (\frac{j}{k} - y) + b (\frac{j}{k} - y)^2

hide / / print
ref: -0 tags: nonlinear hebbian synaptic learning rules projection pursuit date: 12-12-2019 00:21 gmt revision:4 [3] [2] [1] [0] [head]

PMID-27690349 Nonlinear Hebbian Learning as a Unifying Principle in Receptive Field Formation

  • Here we show that the principle of nonlinear Hebbian learning is sufficient for receptive field development under rather general conditions.
  • The nonlinearity is defined by the neuron’s f-I curve combined with the nonlinearity of the plasticity function. The outcome of such nonlinear learning is equivalent to projection pursuit [18, 19, 20], which focuses on features with non-trivial statistical structure, and therefore links receptive field development to optimality principles.
  • Δwxh(g(w Tx))\Delta w \propto x h(g(w^T x)) where h is the hebbian plasticity term, and g is the neurons f-I curve (input-output relation), and x is the (sensory) input.
  • The relevant property of natural image statistics is that the distribution of features derived from typical localized oriented patterns has high kurtosis [5,6, 39]
  • Model is a generalized leaky integrate and fire neuron, with triplet STDP