{1524} revision 1 modified: 10-11-2020 04:09 gmt

Brain-inspired replay for continual learning with artificial neural networks

  • Gudo M van de Ven, Hava Siegelmann, Andreas Tolias
  • In the real world, samples are not replayed in shuffled order -- they occur in a sequence, typically few times. Hence, for training an ANN (or NN?), you need to 'replay' samples.
    • Perhaps, to get at hidden structure not obvious on first pass through the sequence.
    • In the brain, reactivation / replay likely to stabilize memories.
      • Strong evidence that this occurs through sharp-wave ripples (or the underlying activity associated with this).
  • Replay is also used to combat a common problem in training ANNs - catastrophic forgetting.
    • Generally you just re-sample from your database (easy), though in real-time applications, this is not possible.
      • It might also take a lot of memory (though that is cheap these days) or violate privacy (though again who cares about that)

  • They study two different classification problems:
    • Task incremental learning (Task-IL)
      • Agent has to serially learn distinct tasks
      • OK for Atari, doesn't make sense for classification
    • Class incremental learning (Class-IL)
      • Agent has to learn one task incrementally, one/few classes at a time.
      • Like learning a 2 digits at a time in MNIST
        • But is tested on all digits shown so far.
  • Solved via Generative Replay (GR, ~2017)
  • Use a recursive formulation: 'old' generative model is used to generate samples, which are then classified and fed, interleaved with the new samples, to the new network being trained.
    • 'Old' samples can be infrequent -- it's easier to reinforce an existing memory rather than create a new one.
    • Generative model is a VAE.
  • Compared with some existing solutions to catastrophic forgetting:
    • Methods to protect parameters in the network important for previous tasks
      • Elastic weight consolidation (EWC)
      • Synaptic intelligence (SI)
        • Both methods maintain estimates of how influential parameters were for previous tasks, and penalize changes accordingly.
        • "metaplasticity"
        • Synaptic intelligence: measure the loss change relative to the individual weights.
        • δL=δLδθδθδtδt \delta L = \int \frac{\delta L}{\delta \theta} \frac{\delta \theta}{\delta t} \delta t ; converted into discrete time / SGD: L=Σ kω k=ΣδLδθδθδtδt L = \Sigma_k \omega_k = \Sigma \int \frac{\delta L}{\delta \theta} \frac{\delta \theta}{\delta t} \delta t
        • ω k\omega_k are then the weightings for how much parameter change contributed to the training improvement.
        • Use this as a per-parameter regularization strength, scaled by one over the square of 'how far it moved'.
        • This is added to the loss, so that the network is penalized for moving important weights.
    • Context-dependent gating (XdG)
      • To reduce interference between tasks, a random subset of neurons is gated off (inhibition), depending on the task.
    • Learning without forgetting (LwF)
      • Method replays current task input after labeling them (incorrectly?) using the model trained on the previous tasks.
  • Generative replay works on Class-IL!
  • And is robust -- not to many samples or hidden units needed (for MNIST)

  • Yet the generative replay system does not scale to CIFAR or permuted MNIST.
  • E.g. if you take the MNIST pixels, permute them based on a 'task', and ask a network to still learn the character identities , it can't do it ... though synaptic intelligence can.
  • Their solution is to make 'brain-inspired' modifications to the network:
    • RtF, Replay-though-feedback: the classifier and generator network are fused. Latent vector is the hippocampus. Cortex is the VAE / classifier.
    • Con, Conditional replay: normal prior for the VAE is replaced with multivariate class-conditional Gaussian.
      • Not sure how they sample from this, check the methods.
    • Gat, Gating based on internal context.
      • Gating is only applied to the feedback layers, since for classification ... you don't a priori know the class!
    • Int, Internal replay. This is maybe the most interesting: rather than generating pixels, feedback generates hidden layer activations.
      • First layer of a network is convolutional, dependent on visual feature statistics, and should not change much.
        • Indeed, for CIFAR, they use pre-trained layers.
      • Internal replay proved to be very important!
    • Dist, Soft target labeling of the generated targets; cross-entropy loss when training the classifier on generated samples. Aka distillation.
  • Results suggest that regularization / metaplasticity (keeping memories in parameter space) and replay (keeping memories in function space) are complementary strategies,
    • And that the brain uses both to create and protect memories.

  • When I first read this paper, it came across as a great story -- well thought out, well explained, a good level of detail, and sufficiently supported by data / lesioning experiments.
  • However, looking at the first authors pub record, it seems that he's been at this for >2-3 years ... things take time to do & publish.
  • Folding in of the VAE is satisfying -- taking one function approximator and use it to provide memory for another function approximator.
  • Also satisfying are the neurological inspirations -- and that full feedback to the pixel level was not required!
    • Maybe the hippocampus does work like this, providing high-level feature vectors to the cortex.
    • And it's likely that the cortex has some features of a VAE, e.g. able to perceive and imagine through the same nodes, just run in different directions.
      • The fact that both concepts led to an engineering solution is icing on the cake!