use https for features.
text: sort by
tags: modified
type: chronology
hide / / print
ref: -2015 tags: olshausen redwood autoencoder VAE MNIST faces variation date: 11-27-2020 03:04 gmt revision:0 [head]

Discovering hidden factors of variation in deep networks

  • Well, they are not really that deep ...
  • Use a VAE to encode both a supervised signal (class labels) as well as unsupervised latents.
  • Penalize a combination of the MSE of reconstruction, logits of the classification error, and a special cross-covariance term to decorrelate the supervised and unsupervised latent vectors.
  • Cross-covariance penalty:
  • Tested on
    • MNIST -- discovered style / rotation of the characters
    • Toronto faces database -- seven expressions, many individuals; extracted eigen-emotions sorta.
    • Multi-PIE --many faces, many viewpoints ; was able to vary camera pose and illumination with the unsupervised latents.

hide / / print
ref: -2020 tags: replay hippocampus variational autoencoder date: 10-11-2020 04:09 gmt revision:1 [0] [head]

Brain-inspired replay for continual learning with artificial neural networks

  • Gudo M van de Ven, Hava Siegelmann, Andreas Tolias
  • In the real world, samples are not replayed in shuffled order -- they occur in a sequence, typically few times. Hence, for training an ANN (or NN?), you need to 'replay' samples.
    • Perhaps, to get at hidden structure not obvious on first pass through the sequence.
    • In the brain, reactivation / replay likely to stabilize memories.
      • Strong evidence that this occurs through sharp-wave ripples (or the underlying activity associated with this).
  • Replay is also used to combat a common problem in training ANNs - catastrophic forgetting.
    • Generally you just re-sample from your database (easy), though in real-time applications, this is not possible.
      • It might also take a lot of memory (though that is cheap these days) or violate privacy (though again who cares about that)

  • They study two different classification problems:
    • Task incremental learning (Task-IL)
      • Agent has to serially learn distinct tasks
      • OK for Atari, doesn't make sense for classification
    • Class incremental learning (Class-IL)
      • Agent has to learn one task incrementally, one/few classes at a time.
      • Like learning a 2 digits at a time in MNIST
        • But is tested on all digits shown so far.
  • Solved via Generative Replay (GR, ~2017)
  • Use a recursive formulation: 'old' generative model is used to generate samples, which are then classified and fed, interleaved with the new samples, to the new network being trained.
    • 'Old' samples can be infrequent -- it's easier to reinforce an existing memory rather than create a new one.
    • Generative model is a VAE.
  • Compared with some existing solutions to catastrophic forgetting:
    • Methods to protect parameters in the network important for previous tasks
      • Elastic weight consolidation (EWC)
      • Synaptic intelligence (SI)
        • Both methods maintain estimates of how influential parameters were for previous tasks, and penalize changes accordingly.
        • "metaplasticity"
        • Synaptic intelligence: measure the loss change relative to the individual weights.
        • δL=δLδθδθδtδt \delta L = \int \frac{\delta L}{\delta \theta} \frac{\delta \theta}{\delta t} \delta t ; converted into discrete time / SGD: L=Σ kω k=ΣδLδθδθδtδt L = \Sigma_k \omega_k = \Sigma \int \frac{\delta L}{\delta \theta} \frac{\delta \theta}{\delta t} \delta t
        • ω k\omega_k are then the weightings for how much parameter change contributed to the training improvement.
        • Use this as a per-parameter regularization strength, scaled by one over the square of 'how far it moved'.
        • This is added to the loss, so that the network is penalized for moving important weights.
    • Context-dependent gating (XdG)
      • To reduce interference between tasks, a random subset of neurons is gated off (inhibition), depending on the task.
    • Learning without forgetting (LwF)
      • Method replays current task input after labeling them (incorrectly?) using the model trained on the previous tasks.
  • Generative replay works on Class-IL!
  • And is robust -- not to many samples or hidden units needed (for MNIST)

  • Yet the generative replay system does not scale to CIFAR or permuted MNIST.
  • E.g. if you take the MNIST pixels, permute them based on a 'task', and ask a network to still learn the character identities , it can't do it ... though synaptic intelligence can.
  • Their solution is to make 'brain-inspired' modifications to the network:
    • RtF, Replay-though-feedback: the classifier and generator network are fused. Latent vector is the hippocampus. Cortex is the VAE / classifier.
    • Con, Conditional replay: normal prior for the VAE is replaced with multivariate class-conditional Gaussian.
      • Not sure how they sample from this, check the methods.
    • Gat, Gating based on internal context.
      • Gating is only applied to the feedback layers, since for classification ... you don't a priori know the class!
    • Int, Internal replay. This is maybe the most interesting: rather than generating pixels, feedback generates hidden layer activations.
      • First layer of a network is convolutional, dependent on visual feature statistics, and should not change much.
        • Indeed, for CIFAR, they use pre-trained layers.
      • Internal replay proved to be very important!
    • Dist, Soft target labeling of the generated targets; cross-entropy loss when training the classifier on generated samples. Aka distillation.
  • Results suggest that regularization / metaplasticity (keeping memories in parameter space) and replay (keeping memories in function space) are complementary strategies,
    • And that the brain uses both to create and protect memories.

  • When I first read this paper, it came across as a great story -- well thought out, well explained, a good level of detail, and sufficiently supported by data / lesioning experiments.
  • However, looking at the first authors pub record, it seems that he's been at this for >2-3 years ... things take time to do & publish.
  • Folding in of the VAE is satisfying -- taking one function approximator and use it to provide memory for another function approximator.
  • Also satisfying are the neurological inspirations -- and that full feedback to the pixel level was not required!
    • Maybe the hippocampus does work like this, providing high-level feature vectors to the cortex.
    • And it's likely that the cortex has some features of a VAE, e.g. able to perceive and imagine through the same nodes, just run in different directions.
      • The fact that both concepts led to an engineering solution is icing on the cake!

hide / / print
ref: -2017 tags: google deepmind compositional variational autoencoder date: 04-08-2020 01:16 gmt revision:7 [6] [5] [4] [3] [2] [1] [head]

SCAN: learning hierarchical compositional concepts

  • From DeepMind, first version Jul 2017 / v3 June 2018.
  • Starts broad and strong:
    • "The seemingly infinite diversity of the natural world from a relatively small set of coherent rules"
      • Relative to what? What's the order of magnitude here? In personal experience, each domain involves a large pile of relevant details..
    • "We conjecture that these rules dive rise to regularities that can be discovered through primarily unsupervised experiences and represented as abstract concepts"
    • "If such representations are compositional and hierarchical, they can be recombined into an exponentially large set of new concepts."
    • "Compositionality is at the core of such human abilities as creativity, imagination, and language-based communication.
    • This addresses the limitations of deep learning, which are overly data hungry (low sample efficiency), tend to overfit the data, and require human supervision.
  • Approach:
    • Factorize the visual world with a Β\Beta -VAE to learn a set of representational primitives through unsupervised exposure to visual data.
    • Expose SCAN (or rather, a module of it) to a small number of symbol-image pairs, from which the algorithm identifies the set if visual primitives (features from beta-VAE) that the examples have in common.
      • E.g. this is purely associative learning, with a finite one-layer association matrix.
    • Test on both image 2 symbols and symbols to image directions. For the latter, allow irrelevant attributes to be filled in from the priors (this is important later in the paper..)
    • Add in a third module, which allows learning of compositions of the features, ala set notation: AND ( \cup ), IN-COMMON ( \cap ) & IGNORE ( \setminus or '-'). This is via a low-parameter convolutional model.
  • Notation:
    • q ϕ(z x|x)q_{\phi}(z_x|x) is the encoder model. ϕ\phi are the encoder parameters, xx is the visual input, z xz_x are the latent parameters inferred from the scene.
    • p theta(x|z x)p_{theta}(x|z_x) is the decoder model. xp θ(x|z x)x \propto p_{\theta}(x|z_x) , θ\theta are the decoder parameters. xx is now the reconstructed scene.
  • From this, the loss function of the beta-VAE is:
    • 𝕃(θ,ϕ;x,z x,β)=𝔼 q ϕ(z x|x)[logp θ(x|z x)]βD KL(q ϕ(z x|x)||p(z x)) \mathbb{L}(\theta, \phi; x, z_x, \beta) = \mathbb{E}_{q_{\phi}(z_x|x)} [log p_{\theta}(x|z_x)] - \beta D_{KL} (q_{\phi}(z_x|x)|| p(z_x)) where Β>1\Beta \gt 1
      • That is, maximize the auto-encoder fit (the expectation of the decoder, over the encoder output -- aka the pixel log-likelihood) minus the KL divergence between the encoder distribution and p(z x)p(z_x)
        • p(z)𝒩(0,I)p(z) \propto \mathcal{N}(0, I) -- diagonal normal matrix.
        • β\beta comes from the Lagrangian solution to the constrained optimization problem:
        • max ϕ,θ𝔼 xD[𝔼 q ϕ(z|x)[logp θ(x|z)]]\max_{\phi,\theta} \mathbb{E}_{x \sim D} [\mathbb{E}_{q_{\phi}(z|x)}[log p_{\theta}(x|z)]] subject to D KL(q ϕ(z|x)||p(z))<εD_{KL}(q_{\phi}(z|x)||p(z)) \lt \epsilon where D is the domain of images etc.
      • Claim that this loss function tips the scale too far away from accurate reconstruction with sufficient visual de-tangling (that is: if significant features correspond to small details in pixel space, they are likely to be ignored); instead they adopt the approach of the denoising auto-encoder ref, which uses the feature L2 norm instead of the pixel log-likelihood:
    • 𝕃(θ,ϕ;X,z x,β)=𝔼 q ϕ(z x|x)||J(x^)J(x)|| 2 2βD KL(q ϕ(z x|x)||p(z x)) \mathbb{L}(\theta, \phi; X, z_x, \beta) = -\mathbb{E}_{q_{\phi}(z_x|x)}||J(\hat{x}) - J(x)||_2^2 - \beta D_{KL} (q_{\phi}(z_x|x)|| p(z_x)) where J: WxHxC NJ : \mathbb{R}^{W x H x C} \rightarrow \mathbb{R}^N maps from images to high-level features.
      • This J(x)J(x) is from another neural network (transfer learning) which learns features beforehand.
      • It's a multilayer perceptron denoising autoencoder [Vincent 2010].
  • The SCAN architecture includes an additional element, another VAE which is trained simultaneously on the labeled inputs yy and the latent outputs from encoder z xz_x given xx .
  • In this way, they can present a description yy to the network, which is then recomposed into z yz_y , that then produces an image x^\hat{x} .
    • The whole network is trained by minimizing:
    • 𝕃 y(θ y,ϕ y;y,x,z y,β,λ)=1 st2 nd3 rd \mathbb{L}_y(\theta_y, \phi_y; y, x, z_y, \beta, \lambda) = 1^{st} - 2^{nd} - 3^{rd}
      • 1st term: 𝔼 q ϕ y(z y|y)[logp θ y(y|z y)] \mathbb{E}_{q_{\phi_y}(z_y|y)}[log p_{\theta_y} (y|z_y)] log-likelihood of the decoded symbols given encoded latents z yz_y
      • 2nd term: βD KL(q ϕ y(z y|y)||p(z y)) \beta D_{KL}(q_{\phi_y}(z_y|y) || p(z_y)) weighted KL divergence between encoded latents and diagonal normal prior.
      • 3rd term: λD KL(q ϕ x(z x|y)||q ϕ y(z y|y))\lambda D_{KL}(q_{\phi_x}(z_x|y) || q_{\phi_y}(z_y|y)) weighted KL divergence between latents from the images and latents from the description yy .
        • They note that the direction of the divergence matters; I suspect it took some experimentation to see what's right.
  • Final element! A convolutional recombination element, implemented as a tensor product between z y1z_{y1} and z y2z_{y2} that outputs a one-hot encoding of set-operation that's fed to a (hardcoded?) transformation matrix.
    • I don't think this is great shakes. Could have done this with a small function; no need for a neural network.
    • Trained with very similar loss function as SCAN or the beta-VAE.

  • Testing:
  • They seem to have used a very limited subset of "DeepMind Lab" -- all of the concept or class labels could have been implimented easily, e.g. single pixel detector for the wall color. Quite disappointing.
  • This is marginally more interesting -- the network learns to eliminate latent factors as it's exposed to examples (just like perhaps a Bayesian network.)
  • Similarly, the CelebA tests are meh ... not a clear improvement over the existing VAEs.

hide / / print
ref: -0 tags: variational free energy inference learning bayes curiosity insight Karl Friston date: 02-15-2019 02:09 gmt revision:1 [0] [head]

PMID-28777724 Active inference, curiosity and insight. Karl J. Friston, Marco Lin, Christopher D. Frith, Giovanni Pezzulo,

  • This has been my intuition for a while; you can learn abstract rules via active probing of the environment. This paper supports such intuitions with extensive scholarship.
  • “The basic theme of this article is that one can cast learning, inference, and decision making as processes that resolve uncertanty about the world.
    • References Schmidhuber 1991
  • “A learner should choose a policy that also maximizes the learner’s predictive power. This makes the world both interesting and exploitable.” (Still and Precup 2012)
  • “Our approach rests on the free energy principle, which asserts that any sentient creature must minimize the entropy of its sensory exchanges with the world.” Ok, that might be generalizing things too far..
  • Levels of uncertainty:
    • Perceptual inference, the causes of sensory outcomes under a particular policy
    • Uncertainty about policies or about future states of the world, outcomes, and the probabilistic contingencies that bind them.
  • For the last element (probabilistic contingencies between the world and outcomes), they employ Bayesian model selection / Bayesian model reduction
    • Can occur not only on the data, but exclusively on the initial model itself.
    • “We use simulations of abstract rule learning to show that context-sensitive contingiencies, which are manifest in a high-dimensional space of latent or hidden states, can be learned with straightforward variational principles (ie. minimization of free energy).
  • Assume that initial states and state transitions are known.
  • Perception or inference about hidden states (i.e. state estimation) corresponds to inverting a generative model gievn a sequence of outcomes, while learning involves updating the parameters of the model.
  • The actual task is quite simple: central fixation leads to a color cue. The cue + peripheral color determines either which way to saccade.
  • Gestalt: Good intuitions, but I’m left with the impression that the authors overexplain and / or make the description more complicated that it need be.
    • The actual number of parameters to to be inferred is rather small -- 3 states in 4 (?) dimensions, and these parameters are not hard to learn by minimizing the variational free energy:
    • F=D[Q(x)||P(x)]E q[ln(P(o t|x)]F = D[Q(x)||P(x)] - E_q[ln(P(o_t|x)] where D is the Kullback-Leibler divergence.
      • Mean field approximation: Q(x)Q(x) is fully factored (not here). many more notes

hide / / print
ref: -0 tags: spike sorting variational bayes PCA Japan date: 04-04-2012 20:16 gmt revision:1 [0] [head]

PMID-22448159 Spike sorting of heterogeneous neuron types by multimodality-weighted PCA and explicit robust variational Bayes.

  • Cutting edge windowing-then-sorting method.
  • projection multimodality-weighted principal component analysis (mPCA, novel).
    • Multimodality of a feature is by checking the informativeness using the KS test of a given feature.
  • Also investigate graph laplacian features (GLF), which projects high-dimensional data onto a low-dimensional space while preserving topological structure.
  • Clustering based on variational Bayes for Student's T mixture model (SVB).
    • Does not rely on MAP inference and works reliably over difficult-to sort data, e.g. bursting neurons and sparsely firing neurons.
  • Wavelet preprocessing improves spike separation.
  • open-source, available at http://etos.sourceforge.net/