SCAN: learning hierarchical compositional concepts
 From DeepMind, first version Jul 2017 / v3 June 2018.
 Starts broad and strong:
 "The seemingly infinite diversity of the natural world from a relatively small set of coherent rules"
 Relative to what? What's the order of magnitude here? In personal experience, each domain involves a large pile of relevant details..
 "We conjecture that these rules dive rise to regularities that can be discovered through primarily unsupervised experiences and represented as abstract concepts"
 "If such representations are compositional and hierarchical, they can be recombined into an exponentially large set of new concepts."
 "Compositionality is at the core of such human abilities as creativity, imagination, and languagebased communication.
 This addresses the limitations of deep learning, which are overly data hungry (low sample efficiency), tend to overfit the data, and require human supervision.
 Approach:
 Factorize the visual world with a $\Beta$ VAE to learn a set of representational primitives through unsupervised exposure to visual data.
 Expose SCAN (or rather, a module of it) to a small number of symbolimage pairs, from which the algorithm identifies the set if visual primitives (features from betaVAE) that the examples have in common.
 E.g. this is purely associative learning, with a finite onelayer association matrix.
 Test on both image 2 symbols and symbols to image directions. For the latter, allow irrelevant attributes to be filled in from the priors (this is important later in the paper..)
 Add in a third module, which allows learning of compositions of the features, ala set notation: AND ( $\cup$ ), INCOMMON ( $\cap$ ) & IGNORE ( $\setminus$ or ''). This is via a lowparameter convolutional model.
 Notation:
 $q_{\phi}(z_xx)$ is the encoder model. $\phi$ are the encoder parameters, $x$ is the visual input, $z_x$ are the latent parameters inferred from the scene.
 $p_{theta}(xz_x)$ is the decoder model. $x \propto p_{\theta}(xz_x)$ , $\theta$ are the decoder parameters. $x$ is now the reconstructed scene.
 From this, the loss function of the betaVAE is:
 $\mathbb{L}(\theta, \phi; x, z_x, \beta) = \mathbb{E}_{q_{\phi}(z_xx)} [log p_{\theta}(xz_x)]  \beta D_{KL} (q_{\phi}(z_xx) p(z_x))$ where $\Beta \gt 1$
 That is, maximize the autoencoder fit (the expectation of the decoder, over the encoder output  aka the pixel loglikelihood) minus the KL divergence between the encoder distribution and $p(z_x)$
 $p(z) \propto \mathcal{N}(0, I)$  diagonal normal matrix.
 $\beta$ comes from the Lagrangian solution to the constrained optimization problem:
 $\max_{\phi,\theta} \mathbb{E}_{x \sim D} [\mathbb{E}_{q_{\phi}(zx)}[log p_{\theta}(xz)]]$ subject to $D_{KL}(q_{\phi}(zx)p(z)) \lt \epsilon$ where D is the domain of images etc.
 Claim that this loss function tips the scale too far away from accurate reconstruction with sufficient visual detangling (that is: if significant features correspond to small details in pixel space, they are likely to be ignored); instead they adopt the approach of the denoising autoencoder ref, which uses the feature L2 norm instead of the pixel loglikelihood:
 $\mathbb{L}(\theta, \phi; X, z_x, \beta) = \mathbb{E}_{q_{\phi}(z_xx)}J(\hat{x})  J(x)_2^2  \beta D_{KL} (q_{\phi}(z_xx) p(z_x))$ where $J : \mathbb{R}^{W x H x C} \rightarrow \mathbb{R}^N$ maps from images to highlevel features.
 This $J(x)$ is from another neural network (transfer learning) which learns features beforehand.
 It's a multilayer perceptron denoising autoencoder [Vincent 2010].
 The SCAN architecture includes an additional element, another VAE which is trained simultaneously on the labeled inputs $y$ and the latent outputs from encoder $z_x$ given $x$ .

 In this way, they can present a description $y$ to the network, which is then recomposed into $z_y$ , that then produces an image $\hat{x}$ .
 The whole network is trained by minimizing:
 $\mathbb{L}_y(\theta_y, \phi_y; y, x, z_y, \beta, \lambda) = 1^{st}  2^{nd}  3^{rd}$
 1st term: $\mathbb{E}_{q_{\phi_y}(z_yy)}[log p_{\theta_y} (yz_y)]$ loglikelihood of the decoded symbols given encoded latents $z_y$
 2nd term: $\beta D_{KL}(q_{\phi_y}(z_yy)  p(z_y))$ weighted KL divergence between encoded latents and diagonal normal prior.
 3rd term: $\lambda D_{KL}(q_{\phi_x}(z_xy)  q_{\phi_y}(z_yy))$ weighted KL divergence between latents from the images and latents from the description $y$ .
 They note that the direction of the divergence matters; I suspect it took some experimentation to see what's right.
 Final element! A convolutional recombination element, implemented as a tensor product between $z_{y1}$ and $z_{y2}$ that outputs a onehot encoding of setoperation that's fed to a (hardcoded?) transformation matrix.
 I don't think this is great shakes. Could have done this with a small function; no need for a neural network.
 Trained with very similar loss function as SCAN or the betaVAE.
 Testing:

 They seem to have used a very limited subset of "DeepMind Lab"  all of the concept or class labels could have been implimented easily, e.g. single pixel detector for the wall color. Quite disappointing.

 This is marginally more interesting  the network learns to eliminate latent factors as it's exposed to examples (just like perhaps a Bayesian network.)
 Similarly, the CelebA tests are meh ... not a clear improvement over the existing VAEs.

Building Highlevel Features Using Large Scale Unsupervised Learning
 Quoc V. Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeff Dean, Andrew Y. Ng
 Input data 10M random 200x200 frames from youtube. Each video contributes only one frame.
 Used local receptive fields, to reduce the communication requirements. 1000 computers, 16 cores each, 3 days.
 "Strongly influenced by" Olshausen & Field {1448}  but this is limited to a shallow architecture.
 Lee et al 2008 show that stacked RBMs can model simple functions of the cortex.
 Lee et al 2009 show that convolutonal DBN trained on faces can learn a face detector.

 Their architecture: sparse deep autoencoder with
 Local receptive fields: each feature of the autoencoder can connect to only a small region of the lower layer (e.g. nonconvolutional)
 Purely linear layer.
 More biologically plausible & allows the learning of more invariances other than translational invariances (Le et al 2010).
 No weight sharing means the network is extra large == 1 billion weights.
 Still, the human visual cortex is about a million times larger in neurons and synapses.
 L2 pooling (Hyvarinen et al 2009) which allows the learning of invariant features.
 E.g. this is the square root of the sum of the squares of its inputs. Square root nonlinearity.
 Local contrast normalization  subtractive and divisive (Jarrett et al 2009)
 Encoding weights $W_1$ and deconding weights $W_2$ are adjusted to minimize the reconstruction error, penalized by 0.1 * the sparse pooling layer activation. Latter term encourages the network to find invariances.
 $minimize(W_1, W_2)$ $\sum_{i=1}^m {({ W_2 W_1^T x^{(i)}  x^{(i)} ^2_2 + \lambda \sum_{j=1}^k{ \sqrt{\epsilon + H_j(W_1^T x^{(i)})^2}} })}$
 $H_j$ are the weights to the jth pooling element, $\lambda = 0.1$ ; m examples; k pooling units.
 This is also known as reconstruction Topographic Independent Component Analysis.
 Weights are updated through asynchronous SGD.
 Minibatch size 100.

 Note deeper autoencoders don't fare consistently better.
