m8ta
use https for features.
text: sort by
tags: modified
type: chronology
{1454}
hide / / print
ref: -2011 tags: Andrew Ng high level unsupervised autoencoders date: 03-15-2019 06:09 gmt revision:7 [6] [5] [4] [3] [2] [1] [head]

Building High-level Features Using Large Scale Unsupervised Learning

  • Quoc V. Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeff Dean, Andrew Y. Ng
  • Input data 10M random 200x200 frames from youtube. Each video contributes only one frame.
  • Used local receptive fields, to reduce the communication requirements. 1000 computers, 16 cores each, 3 days.
  • "Strongly influenced by" Olshausen & Field {1448} -- but this is limited to a shallow architecture.
  • Lee et al 2008 show that stacked RBMs can model simple functions of the cortex.
  • Lee et al 2009 show that convolutonal DBN trained on faces can learn a face detector.
  • Their architecture: sparse deep autoencoder with
    • Local receptive fields: each feature of the autoencoder can connect to only a small region of the lower layer (e.g. non-convolutional)
      • Purely linear layer.
      • More biologically plausible & allows the learning of more invariances other than translational invariances (Le et al 2010).
      • No weight sharing means the network is extra large == 1 billion weights.
        • Still, the human visual cortex is about a million times larger in neurons and synapses.
    • L2 pooling (Hyvarinen et al 2009) which allows the learning of invariant features.
      • E.g. this is the square root of the sum of the squares of its inputs. Square root nonlinearity.
    • Local contrast normalization -- subtractive and divisive (Jarrett et al 2009)
  • Encoding weights W 1W_1 and deconding weights W 2W_2 are adjusted to minimize the reconstruction error, penalized by 0.1 * the sparse pooling layer activation. Latter term encourages the network to find invariances.
  • minimize(W 1,W 2) minimize(W_1, W_2) i=1 m(||W 2W 1 Tx (i)x (i)|| 2 2+λ j=1 kε+H j(W 1 Tx (i)) 2) \sum_{i=1}^m {({ ||W_2 W_1^T x^{(i)} - x^{(i)} ||^2_2 + \lambda \sum_{j=1}^k{ \sqrt{\epsilon + H_j(W_1^T x^{(i)})^2}} })}
    • H jH_j are the weights to the j-th pooling element, λ=0.1\lambda = 0.1 ; m examples; k pooling units.
    • This is also known as reconstruction Topographic Independent Component Analysis.
    • Weights are updated through asynchronous SGD.
    • Minibatch size 100.
    • Note deeper autoencoders don't fare consistently better.