m8ta
You are not authenticated, login.
text: sort by
tags: modified
type: chronology
{1552}
hide / / print
ref: -2020 tags: Principe modular deep learning kernel trick MNIST CIFAR date: 10-06-2021 16:54 gmt revision:2 [1] [0] [head]

Modularizing Deep Learning via Pairwise Learning With Kernels

  • Shiyu Duan, Shujian Yu, Jose Principe
  • The central idea here is to re-interpret deep networks, not with the nonlinearity as the output of a layer, but rather as the input of the layer, with the regression (weights) being performed on this nonlinear projection.
  • In this sense, each re-defined layer is implementing the 'kernel trick': tasks (like classification) which are difficult in linear spaces, become easier when projected into some sort of kernel space.
    • The kernel allows pairwise comparisons of datapoints. EG. a radial basis kernel measures the radial / gaussian distance between data points. A SVM is a kernel machine in this sense.
      • As a natural extension (one that the authors have considered) is to take non-pointwise or non-one-to-one kernel functions -- those that e.g. multiply multiple layer outputs. This is of course part of standard kernel machines.
  • Because you are comparing projected datapoints, it's natural to take contrastive loss on each layer to tune the weights to maximize the distance / discrimination between different classes.
    • Hence this is semi-supervised contrastive classification, something that is quite popular these days.
    • The last layer is of tuned with cross-entropy labels, but only a few are required since the data is well distributed already.
  • Demonstrated on small-ish datasets, concordant with their computational resources ...

I think in general this is an important result, even if its not wholly unique / somewhat anticipated (it's a year old at the time of writing). Modular training of neural networks is great for efficiency, parallelization, and biological implementations! Transport of weights between layers is hence non-essential.

Classes still are, but I wonder if temporal continuity can solve some of these problems?

(There is plenty of other effort in this area -- see also {1544})

{1547}
hide / / print
ref: -2018 tags: luke metz meta learning google brain sgd model mnist Hebbian date: 08-05-2021 01:07 gmt revision:2 [1] [0] [head]

Meta-Learning Update Rules for Unsupervised Representation Learning

  • Central idea: meta-train a training-network (a MLP) which trains a task-network (also a MLP) to do unsupervised learning on one dataset.
  • The training network is optimized through SGD based on small-shot linear learning on a test set, typically different from the unsupervised training set.
  • The training-network is a per-weight MLP which takes in layer input, layer output, and a synthetic error (denoted η\eta ), and generates a and b, which are then fed into an outer-product Hebbian learning rule.
  • η\eta itself is formed through a backward pass through weights VV , which affords something like backprop -- but not exactly backprop, of course. See the figure.
  • Training consists of building up very large, backward through time gradient estimates relative to the parameters of the training-network. (And there are a lot!)
  • Trained on CIFAR10, MNIST, FashionMNIST, IMDB sentiment prediction. All have their input permuted to keep the training-network from learning per-task weights. Instead the network should learn to interpret the statistics between datapoints.
  • Indeed, it does this -- albeit with limits. Performance is OK, but only if you only do supervised learning on the very limited dataset used in the meta-optimization.
    • In practice, it's possible to completely solve tasks like MNIST with supervised learning; this gets to about 80% accuracy.
  • Images were kept small -- about 20x20 -- to speed up the inner loop unsupervised learning. Still, this took on the order of 200 hours across ~500 TPUs.
  • See, as a comparison, Keren's paper, Meta-learning biologically plausible semi-supervised update rules. It's conceptually nice but only evaluates the two-moons and two-gaussian datasets.

This is a clearly-written, easy to understand paper. The results are not highly compelling, but as a first set of experiments, it's successful enough.

I wonder what more constraints (fewer parameters, per the genome), more options for architecture modifications (e.g. different feedback schemes, per neurobiology), and a black-box optimization algorithm (evolution) would do?

{1528}
hide / / print
ref: -2015 tags: olshausen redwood autoencoder VAE MNIST faces variation date: 11-27-2020 03:04 gmt revision:0 [head]

Discovering hidden factors of variation in deep networks

  • Well, they are not really that deep ...
  • Use a VAE to encode both a supervised signal (class labels) as well as unsupervised latents.
  • Penalize a combination of the MSE of reconstruction, logits of the classification error, and a special cross-covariance term to decorrelate the supervised and unsupervised latent vectors.
  • Cross-covariance penalty:
  • Tested on
    • MNIST -- discovered style / rotation of the characters
    • Toronto faces database -- seven expressions, many individuals; extracted eigen-emotions sorta.
    • Multi-PIE --many faces, many viewpoints ; was able to vary camera pose and illumination with the unsupervised latents.

{1455}
hide / / print
ref: -0 tags: credit assignment distributed feedback alignment penn state MNIST fashion backprop date: 03-16-2019 02:21 gmt revision:1 [0] [head]

Conducting credit assignment by aligning local distributed representations

  • Alexander G. Ororbia, Ankur Mali, Daniel Kifer, C. Lee Giles
  • Propose two related algorithms: Local Representation Alignment (LRA)-diff and LRA-fdbk.
    • LRA-diff is basically a modified form of backprop.
    • LRA-fdbk is a modified version of feedback alignment. {1432} {1423}
  • Test on MNIST (easy -- many digits can be discriminated with one pixel!) and fashion-MNIST (harder -- humans only get about 85% right!)
  • Use a Cauchy or log-penalty loss at each layer, which is somewhat unique and interesting: L(z,y)= i=1 nlog(1+(y iz i) 2)L(z,y) = \sum_{i=1}^n{ log(1 + (y_i - z_i)^2)} .
    • This is hence a saturating loss.
  1. Normal multi-layer-perceptron feedforward network. pre activation h h^\ell and post activation z z^\ell are stored.
  2. Update the weights to minimize loss. This gradient calculation is identical to backprop, only they constrain the update to have a norm no bigger than c 1c_1 . Z and Y are actual and desired output of the layer, as commented. Gradient includes the derivative of the nonlinear activation function.
  3. Generaete update for the pre-nonlinearity h 1h^{\ell-1} to minimize the loss in the layer above. This again is very similar to backprop; its' the chain rule -- but the derivatives are vectors, of course, so those should be element-wise multiplication, not outer produts (i think).
    1. Note hh is updated -- derivatives of two nonlinearities.
  4. Feedback-alignment version, with random matrix E E_{\ell} (elements drawn from a gaussian distribution, σ=1\sigma = 1 ish.
    1. Only one nonlinearity derivative here -- bug?
  5. Move the rep and post activations in the specified gradient direction.
    1. Those h¯ 1\bar{h}^{\ell-1} variables are temporary holding -- but note that both lower and higher layers are updated.
  6. Do this K of times, K=1-50.
  • In practice K=1, with the LRA-fdbk algorithm, for the majority of the paper -- it works much better than LRA-diff (interesting .. bug?). Hence, this basically reduces to feedback alignment.
  • Demonstrate that LRA works much better with small initial weights, but basically because they tweak the algorithm to do this.
    • Need to see a positive control for this to be conclusive.
    • Again, why is FA so different from LRA-fdbk? Suspicious. Positive controls.
  • Attempted a network with Local Winner Take All (LWTA), which is a hard nonlinearity that LFA was able to account for & train through.
  • Also used Bernoulli neurons, and were able to successfully train. Unlike drop-out, these were stochastic at test time, and things still worked OK.

Lit review.
  • Logistic sigmoid can slow down learning, due to it's non-zero mean (Glorot & Bengio 2010).
  • Recirculation algorithm (or generalized recirculation) is a precursor for target propagation.
  • Target propagation is all about the inverse of the forward propagation: if we had access to the inverse of the network of forward propagations, we could compute which input values at the lower levels of the network would result in better values at the top that would please the global cost.
    • This is a very different way of looking at it -- almost backwards!
    • And indeed, it's not really all that different from contrastive divergence. (even though CD doesn't work well with non-Bernoulli units)
  • Contractive Hebbian learning also has two phases, one to fantasize, and done to try to make the fantasies look more like the input data.
  • Decoupled neural interfaces (Jaderberg et al 2016): learn a predictive model of error gradients (and inputs) nistead of trying to use local information to estimate updated weights.

  • Yeah, call me a critic, but I'm not clear on the contribution of this paper; it smells precocious and over-sold.
    • Even the title. I was hoping for something more 'local' than per-layer computation. BP does that already!
  • They primarily report supportive tests, not discriminative or stressing tests; how does the algorithm fail?
    • Certainly a lot of work went into it..
  • I still don't see how the computation of a target through a ransom matrix, then using delta/loss/error between that target and the feedforward activation to update weights, is much different than propagating the errors directly through a random feedback matrix. Eg. subtract then multiply, or multiply then subtract?

{1426}
hide / / print
ref: -2019 tags: Arild Nokland local error signals backprop neural networks mnist cifar VGG date: 02-15-2019 03:15 gmt revision:6 [5] [4] [3] [2] [1] [0] [head]

Training neural networks with local error signals

  • Arild Nokland and Lars H Eidnes
  • Idea is to use one+ supplementary neural networks to measure within-batch matching loss between transformed hidden-layer output and one-hot label data to produce layer-local learning signals (gradients) for improving local representation.
  • Hence, no backprop. Error signals are all local, and inter-layer dependencies are not explicitly accounted for (! I think).
  • L simL_{sim} : given a mini-batch of hidden layer activations H=(h 1,...,h n)H = (h_1, ..., h_n) and a one-hot encoded label matrix Y=(y 1,...,y nY = (y_1, ..., y_n ,
    • L sim=||S(NeuralNet(H))S(Y)|| F 2 L_{sim} = || S(NeuralNet(H)) - S(Y)||^2_F (don't know what F is..)
    • NeuralNet()NeuralNet() is a convolutional neural net (trained how?) 3*3, stride 1, reduces output to 2.
    • S()S() is the cosine similarity matrix, or correlation matrix, of a mini-batch.
  • L pred=CrossEntropy(Y,W TH)L_{pred} = CrossEntropy(Y, W^T H) where W is a weight matrix, dim hidden_size * n_classes.
    • Cross-entropy is H(Y,W TH)=Σ i,jY i,jlog((W TH) i,j)+(1Y i,j)log(1(W TH) i,j) H(Y, W^T H) = \Sigma_{i,j} Y_{i,j} log((W^T H)_{i,j}) + (1-Y_{i,j}) log(1-(W^T H)_{i,j})
  • Sim-bio loss: replace NeuralNet()NeuralNet() with average-pooling and standard-deviation op. Plus one-hot target is replaced with a random transformation of the same target vector.
  • Overall loss 99% L simL_sim , 1% L predL_pred
    • Despite the unequal weighting, both seem to improve test prediction on all examples.
  • VGG like network, with dropout and cutout (blacking out square regions of input space), batch size 128.
  • Tested on all the relevant datasets: MNIST, Fashion-MNIST, Kuzushiji-MNIST, CIFAR-10, CIFAR-100, STL-10, SVHN.
  • Pretty decent review of similarity matching measures at the beginning of the paper; not extensive but puts everything in context.
    • See for example non-negative matrix factorization using Hebbian and anti-Hebbian learning in and Chklovskii 2014.
  • Emphasis put on biologically realistic learning, including the use of feedback alignment {1423}
    • Yet: this was entirely supervised learning, as the labels were propagated back to each layer.
    • More likely that biology is setup to maximize available labels (not a new concept).

{1432}
hide / / print
ref: -0 tags: feedback alignment Arild Nokland MNIST CIFAR date: 02-14-2019 02:15 gmt revision:0 [head]

Direct Feedback alignment provides learning in deep neural nets

  • from {1423}
  • Feedback alignment is able to provide zero training error even in convolutional networks and very deep networks, completely without error back-propagation.
  • Biologically plausible: error signal is entirely local, no symmetric or reciprocal weights required.
    • Still, it requires supervision.
  • Almost as good as backprop!
  • Clearly written, easy to follow math.
    • Though the proof that feedback-alignment direction is within 90 deg of backprop is a bit impenetrable, needs some reorganization or additional exposition / annotation.
  • 3x400 tanh network tested on MNIST; performs similarly to backprop, if faster.
  • Also able to train very deep networks, on MNIST - CIFAR-10, CIFAR-100, 100 layers (which actually hurts this task).

{1423}
hide / / print
ref: -2014 tags: Lillicrap Random feedback alignment weights synaptic learning backprop MNIST date: 02-14-2019 01:02 gmt revision:5 [4] [3] [2] [1] [0] [head]

PMID-27824044 Random synaptic feedback weights support error backpropagation for deep learning.

  • "Here we present a surprisingly simple algorithm for deep learning, which assigns blame by multiplying error signals by a random synaptic weights.
  • Backprop multiplies error signals e by the weight matrix W T W^T , the transpose of the forward synaptic weights.
  • But the feedback weights do not need to be exactly W T W^T ; any matrix B will suffice, so long as on average:
  • e TWBe>0 e^T W B e > 0
    • Meaning that the teaching signal Be B e lies within 90deg of the signal used by backprop, W Te W^T e
  • Feedback alignment actually seems to work better than backprop in some cases. This relies on starting the weights very small (can't be zero -- no output)

Our proof says that weights W0 and W
evolve to equilibrium manifolds, but simulations (Fig. 4) and analytic results (Supple-
mentary Proof 2) hint at something more specific: that when the weights begin near
0, feedback alignment encourages W to act like a local pseudoinverse of B around
the error manifold. This fact is important because if B were exactly W + (the Moore-
Penrose pseudoinverse of W ), then the network would be performing Gauss-Newton
optimization (Supplementary Proof 3). We call this update rule for the hidden units
pseudobackprop and denote it by ∆hPBP = W + e. Experiments with the linear net-
work show that the angle, ∆hFA ]∆hPBP quickly becomes smaller than ∆hFA ]∆hBP
(Fig. 4b, c; see Methods). In other words feedback alignment, despite its simplicity,
displays elements of second-order learning.