{1507} revision 0 modified: 03-28-2020 01:15 gmt

Winner-take-all Autoencoders

  • During training of fully connected layers, they enforce a winner-take all lifetime sparsity constraint.
    • That is: when training using mini-batches, they keep the k percent largest activation of a given hidden unit across all samples presented in the mini-batch. The remainder of the activations are set to zero. The units are not competing with each other; they are competing with themselves.
    • The rest of the network is a stack of ReLU layers (upon which the sparsity constraint is applied) followed by a linear decoding layer (which makes interpretation simple).
    • They stack them via sequential training: train one layer from the output of another & not backprop the errors.
  • Works, with lower sparsity targets, also for RBMs.
  • Extended the result to WTA covnets -- here enforce both spatial and temporal (mini-batch) sparsity.
    • Spatial sparsity involves selecting the single largest hidden unit activity within each feature map. The other activities and derivatives are set to zero.
    • At test time, this sparsity constraint is released, and instead they use a 4 x 4 max-pooling layer & use that for classification or deconvolution.
  • To apply both spatial and temporal sparsity, select the highest spatial response (e.g. one unit in a 2d plane of convolutions; all have the same weights) for each feature map. Do this for every image in a mini-batch, and then apply the temporal sparsity: each feature map gets to be active exactly once, and in that time only one hidden unit (or really, one location of the input and common weights (depending on stride)) undergoes SGD.
    • Seems like it might train very slowly. Authors didn't note how many epochs were required.
  • This, too can be stacked.
  • To train on larger image sets, they first extract 48 x 48 patches & again stack...
  • Test on MNIST, SVHN, CIFAR-10 -- works ok, and well even with few labeled examples (which is consistent with their goals)