SCAN: learning hierarchical compositional concepts
 From DeepMind, first version Jul 2017 / v3 June 2018.
 Starts broad and strong:
 "The seemingly infinite diversity of the natural world from a relatively small set of coherent rules"
 Relative to what? What's the order of magnitude here? In personal experience, each domain involves a large pile of relevant details..
 "We conjecture that these rules dive rise to regularities that can be discovered through primarily unsupervised experiences and represented as abstract concepts"
 "If such representations are compositional and hierarchical, they can be recombined into an exponentially large set of new concepts."
 "Compositionality is at the core of such human abilities as creativity, imagination, and languagebased communication.
 This addresses the limitations of deep learning, which are overly data hungry (low sample efficiency), tend to overfit the data, and require human supervision.
 Approach:
 Factorize the visual world with a $\Beta$ VAE to learn a set of representational primitives through unsupervised exposure to visual data.
 Expose SCAN (or rather, a module of it) to a small number of symbolimage pairs, from which the algorithm identifies the set if visual primitives (features from betaVAE) that the examples have in common.
 E.g. this is purely associative learning, with a finite onelayer association matrix.
 Test on both image 2 symbols and symbols to image directions. For the latter, allow irrelevant attributes to be filled in from the priors (this is important later in the paper..)
 Add in a third module, which allows learning of compositions of the features, ala set notation: AND ( $\cup$ ), INCOMMON ( $\cap$ ) & IGNORE ( $\setminus$ or ''). This is via a lowparameter convolutional model.
 Notation:
 $q_{\phi}(z_xx)$ is the encoder model. $\phi$ are the encoder parameters, $x$ is the visual input, $z_x$ are the latent parameters inferred from the scene.
 $p_{theta}(xz_x)$ is the decoder model. $x \propto p_{\theta}(xz_x)$ , $\theta$ are the decoder parameters. $x$ is now the reconstructed scene.
 From this, the loss function of the betaVAE is:
 $\mathbb{L}(\theta, \phi; x, z_x, \beta) = \mathbb{E}_{q_{\phi}(z_xx)} [log p_{\theta}(xz_x)]  \beta D_{KL} (q_{\phi}(z_xx) p(z_x))$ where $\Beta \gt 1$
 That is, maximize the autoencoder fit (the expectation of the decoder, over the encoder output  aka the pixel loglikelihood) minus the KL divergence between the encoder distribution and $p(z_x)$
 $p(z) \propto \mathcal{N}(0, I)$  diagonal normal matrix.
 $\beta$ comes from the Lagrangian solution to the constrained optimization problem:
 $\max_{\phi,\theta} \mathbb{E}_{x \sim D} [\mathbb{E}_{q_{\phi}(zx)}[log p_{\theta}(xz)]]$ subject to $D_{KL}(q_{\phi}(zx)p(z)) \lt \epsilon$ where D is the domain of images etc.
 Claim that this loss function tips the scale too far away from accurate reconstruction with sufficient visual detangling (that is: if significant features correspond to small details in pixel space, they are likely to be ignored); instead they adopt the approach of the denoising autoencoder ref, which uses the feature L2 norm instead of the pixel loglikelihood:
 $\mathbb{L}(\theta, \phi; X, z_x, \beta) = \mathbb{E}_{q_{\phi}(z_xx)}J(\hat{x})  J(x)_2^2  \beta D_{KL} (q_{\phi}(z_xx) p(z_x))$ where $J : \mathbb{R}^{W x H x C} \rightarrow \mathbb{R}^N$ maps from images to highlevel features.
 This $J(x)$ is from another neural network (transfer learning) which learns features beforehand.
 It's a multilayer perceptron denoising autoencoder [Vincent 2010].
 The SCAN architecture includes an additional element, another VAE which is trained simultaneously on the labeled inputs $y$ and the latent outputs from encoder $z_x$ given $x$ .

 In this way, they can present a description $y$ to the network, which is then recomposed into $z_y$ , that then produces an image $\hat{x}$ .
 The whole network is trained by minimizing:
 $\mathbb{L}_y(\theta_y, \phi_y; y, x, z_y, \beta, \lambda) = 1^{st}  2^{nd}  3^{rd}$
 1st term: $\mathbb{E}_{q_{\phi_y}(z_yy)}[log p_{\theta_y} (yz_y)]$ loglikelihood of the decoded symbols given encoded latents $z_y$
 2nd term: $\beta D_{KL}(q_{\phi_y}(z_yy)  p(z_y))$ weighted KL divergence between encoded latents and diagonal normal prior.
 3rd term: $\lambda D_{KL}(q_{\phi_x}(z_xy)  q_{\phi_y}(z_yy))$ weighted KL divergence between latents from the images and latents from the description $y$ .
 They note that the direction of the divergence matters; I suspect it took some experimentation to see what's right.
 Final element! A convolutional recombination element, implemented as a tensor product between $z_{y1}$ and $z_{y2}$ that outputs a onehot encoding of setoperation that's fed to a (hardcoded?) transformation matrix.
 I don't think this is great shakes. Could have done this with a small function; no need for a neural network.
 Trained with very similar loss function as SCAN or the betaVAE.
 Testing:

 They seem to have used a very limited subset of "DeepMind Lab"  all of the concept or class labels could have been implimented easily, e.g. single pixel detector for the wall color. Quite disappointing.

 This is marginally more interesting  the network learns to eliminate latent factors as it's exposed to examples (just like perhaps a Bayesian network.)
 Similarly, the CelebA tests are meh ... not a clear improvement over the existing VAEs.
