Learning Explanatory Rules from Noisy Data
 From a dense background of inductive logic programming (ILP): given a set of statements, and rules for transformation and substitution, generate clauses that satisfy a set of 'background knowledge'.
 Programs like Metagol can do this using search and simplify logic built into Prolog.
 Actually kinda surprising how very dense this program is  only 330 lines!
 This task can be transformed into a SAT problem via rules of logic, for which there are many fast solvers.
 The trick here (instead) is that a neural network is used to turn 'on' or 'off' clauses that fit the background knowledge
 BK is typically very small, a few examples, consistent with the small size of the learned networks.
 These weight matrices are represented as the outer product of composed or combined clauses, which makes the weight matrix very large!
 They then do gradient descent, while passing the crossentropy errors through nonlinearities (including clauses themselves? I think this is how recursion is handled.) to update the weights.
 Hence, SGD is used as a means of heuristic search.
 Compare this to Metagol, which is brittle to any noise in the input; unsurprisingly, due to SGD, this is much more robust.

 Way too many words and symbols in this paper for what it seems to be doing. Just seems to be obfuscating the work (which is perfectly good). Again: Metagol is only 330 lines!

SCAN: learning hierarchical compositional concepts
 From DeepMind, first version Jul 2017 / v3 June 2018.
 Starts broad and strong:
 "The seemingly infinite diversity of the natural world from a relatively small set of coherent rules"
 Relative to what? What's the order of magnitude here? In personal experience, each domain involves a large pile of relevant details..
 "We conjecture that these rules dive rise to regularities that can be discovered through primarily unsupervised experiences and represented as abstract concepts"
 "If such representations are compositional and hierarchical, they can be recombined into an exponentially large set of new concepts."
 "Compositionality is at the core of such human abilities as creativity, imagination, and languagebased communication.
 This addresses the limitations of deep learning, which are overly data hungry (low sample efficiency), tend to overfit the data, and require human supervision.
 Approach:
 Factorize the visual world with a $\Beta$ VAE to learn a set of representational primitives through unsupervised exposure to visual data.
 Expose SCAN (or rather, a module of it) to a small number of symbolimage pairs, from which the algorithm identifies the set if visual primitives (features from betaVAE) that the examples have in common.
 E.g. this is purely associative learning, with a finite onelayer association matrix.
 Test on both image 2 symbols and symbols to image directions. For the latter, allow irrelevant attributes to be filled in from the priors (this is important later in the paper..)
 Add in a third module, which allows learning of compositions of the features, ala set notation: AND ( $\cup$ ), INCOMMON ( $\cap$ ) & IGNORE ( $\setminus$ or ''). This is via a lowparameter convolutional model.
 Notation:
 $q_{\phi}(z_xx)$ is the encoder model. $\phi$ are the encoder parameters, $x$ is the visual input, $z_x$ are the latent parameters inferred from the scene.
 $p_{theta}(xz_x)$ is the decoder model. $x \propto p_{\theta}(xz_x)$ , $\theta$ are the decoder parameters. $x$ is now the reconstructed scene.
 From this, the loss function of the betaVAE is:
 $\mathbb{L}(\theta, \phi; x, z_x, \beta) = \mathbb{E}_{q_{\phi}(z_xx)} [log p_{\theta}(xz_x)]  \beta D_{KL} (q_{\phi}(z_xx) p(z_x))$ where $\Beta \gt 1$
 That is, maximize the autoencoder fit (the expectation of the decoder, over the encoder output  aka the pixel loglikelihood) minus the KL divergence between the encoder distribution and $p(z_x)$
 $p(z) \propto \mathcal{N}(0, I)$  diagonal normal matrix.
 $\beta$ comes from the Lagrangian solution to the constrained optimization problem:
 $\max_{\phi,\theta} \mathbb{E}_{x \sim D} [\mathbb{E}_{q_{\phi}(zx)}[log p_{\theta}(xz)]]$ subject to $D_{KL}(q_{\phi}(zx)p(z)) \lt \epsilon$ where D is the domain of images etc.
 Claim that this loss function tips the scale too far away from accurate reconstruction with sufficient visual detangling (that is: if significant features correspond to small details in pixel space, they are likely to be ignored); instead they adopt the approach of the denoising autoencoder ref, which uses the feature L2 norm instead of the pixel loglikelihood:
 $\mathbb{L}(\theta, \phi; X, z_x, \beta) = \mathbb{E}_{q_{\phi}(z_xx)}J(\hat{x})  J(x)_2^2  \beta D_{KL} (q_{\phi}(z_xx) p(z_x))$ where $J : \mathbb{R}^{W x H x C} \rightarrow \mathbb{R}^N$ maps from images to highlevel features.
 This $J(x)$ is from another neural network (transfer learning) which learns features beforehand.
 It's a multilayer perceptron denoising autoencoder [Vincent 2010].
 The SCAN architecture includes an additional element, another VAE which is trained simultaneously on the labeled inputs $y$ and the latent outputs from encoder $z_x$ given $x$ .

 In this way, they can present a description $y$ to the network, which is then recomposed into $z_y$ , that then produces an image $\hat{x}$ .
 The whole network is trained by minimizing:
 $\mathbb{L}_y(\theta_y, \phi_y; y, x, z_y, \beta, \lambda) = 1^{st}  2^{nd}  3^{rd}$
 1st term: $\mathbb{E}_{q_{\phi_y}(z_yy)}[log p_{\theta_y} (yz_y)]$ loglikelihood of the decoded symbols given encoded latents $z_y$
 2nd term: $\beta D_{KL}(q_{\phi_y}(z_yy)  p(z_y))$ weighted KL divergence between encoded latents and diagonal normal prior.
 3rd term: $\lambda D_{KL}(q_{\phi_x}(z_xy)  q_{\phi_y}(z_yy))$ weighted KL divergence between latents from the images and latents from the description $y$ .
 They note that the direction of the divergence matters; I suspect it took some experimentation to see what's right.
 Final element! A convolutional recombination element, implemented as a tensor product between $z_{y1}$ and $z_{y2}$ that outputs a onehot encoding of setoperation that's fed to a (hardcoded?) transformation matrix.
 I don't think this is great shakes. Could have done this with a small function; no need for a neural network.
 Trained with very similar loss function as SCAN or the betaVAE.
 Testing:

 They seem to have used a very limited subset of "DeepMind Lab"  all of the concept or class labels could have been implimented easily, e.g. single pixel detector for the wall color. Quite disappointing.

 This is marginally more interesting  the network learns to eliminate latent factors as it's exposed to examples (just like perhaps a Bayesian network.)
 Similarly, the CelebA tests are meh ... not a clear improvement over the existing VAEs.

PMID31942076 A distributional code for value in dopamine based reinforcement learning
 Synopsis is staggeringly simple: dopamine neurons encode / learn to encode a distribution of reward expectations, not just the mean (aka the expected value) of the reward at a given stateaction pair.
 This is almost obvious neurally  of course dopamine neurons in the striatum represent different levels of reward expectation; there is population diversity in nearly everything in neuroscience. The new interpretation is that neurons have different slopes for their susceptibility to positive and negative rewards (or rather, reward predictions), which results in different inflection points where the neurons are neutral about a reward.
 This constitutes more optimistic and pessimistic neurons.

 There is already substantial evidence that such a distributional representation enhances performance in DQN (Deep qnetworks) from circa 2017; the innovation here is that it has been extended to experiments from 2015 where mice learned to anticipate water rewards with varying volume, or varying probability of arrival.
 The model predicts a diversity of asymmetry below and above the reversal point

 Also predicts that the distribution of reward responses should be decoded by neural activity ... which it is ... but it is not surprising that a bespoke decoder can find this information in the neural firing rates. (Have not examined in depth the decoding methods)

 Still, this is a clear and wellwritten, wellthought out paper; glad to see new parsimonious theories about dopamine out there.
