 m8ta
 {1544} hide / / print ref: -2019 tags: HSIC information bottleneck deep learning backprop gaussian kernel date: 10-06-2021 17:23 gmt revision:5      [head] The HSIC Bottleneck: Deep learning without Back-propagation In this work, the authors use a kernelized estimate of statistical independence as part of a 'information bottleneck' to set per-layer objective functions for learning useful features in a deep network. They use the HSIC, or Hilbert-schmidt independence criterion, as the independence measure. The information bottleneck was proposed by Bailek (spikes..) et al in 1999, and aims to increase the mutual information between the layer representation and the labels while minimizing the mutual information between the representation and the input: $\frac{min}{P_{T_i | X}} I(X; T_i) - \beta I(T_i; Y)$ Where $T_i$ is the hidden representation at layer i (later output), $X$ is the layer input, and $Y$ are the labels. By replacing $I()$ with the HSIC, and some derivation (?), they show that $HSIC(D) = (m-1)^{-2} tr(K_X H K_Y H)$ Where $D = {(x_1,y_1), ... (x_m, y_m)}$ are samples and labels, $K_{X_{ij}} = k(x_i, x_j)$ and $K_{Y_{ij}} = k(y_i, y_j)$ -- that is, it's the kernel function applied to all pairs of (vectoral) input variables. H is the centering matrix. The kernel is simply a Gaussian kernel, $k(x,y) = exp(-1/2 ||x-y||^2/\sigma^2)$ . So, if all the x and y are on average independent, then the inner-product will be mean zero, the kernel will be mean one, and after centering will lead to zero trace. If the inner product is large within the realm of the derivative of the kernel, then the HSIC will be large (and negative, i think). In practice they use three different widths for their kernel, and they also center the kernel matrices. But still, the feedback is an aggregate measure (the trace) of the product of two kernelized (a nonlinearity) outer-product spaces of similarities between inputs. it's not unimaginable that feedback networks could be doing something like this... For example, a neural network could calculate & communicate aspects of joint statistics to reward / penalize weights within a layer of a network, and this is parallelizable / per layer / adaptable to an unsupervised learning regime. Indeed, that was done almost exactly by this paper: Kernelized information bottleneck leads to biologically plausible 3-factor Hebbian learning in deep networks albeit in a much less intelligible way. Robust Learning with the Hilbert-Schmidt Independence Criterion Is another, later, paper using the HSIC. Their interpretation: "This loss-function encourages learning models where the distribution of the residuals between the label and the model prediction is statistically independent of the distribution of the instances themselves." Hence, given above nomenclature, $E_X( P_{T_i | X} I(X ; T_i) ) = 0$ (I'm not totally sure about the weighting, but might be required given the definition of the HSIC.) As I understand it, the HSIC loss is a kernellized loss between the input, output, and labels that encourages a degree of invariance to input ('covariate shift'). This is useful, but I'm unconvinced that making the layer output independent of the input is absolutely essential (??) {1552} hide / / print ref: -2020 tags: Principe modular deep learning kernel trick MNIST CIFAR date: 10-06-2021 16:54 gmt revision:2   [head] Shiyu Duan, Shujian Yu, Jose Principe The central idea here is to re-interpret deep networks, not with the nonlinearity as the output of a layer, but rather as the input of the layer, with the regression (weights) being performed on this nonlinear projection. In this sense, each re-defined layer is implementing the 'kernel trick': tasks (like classification) which are difficult in linear spaces, become easier when projected into some sort of kernel space. The kernel allows pairwise comparisons of datapoints. EG. a radial basis kernel measures the radial / gaussian distance between data points. A SVM is a kernel machine in this sense. As a natural extension (one that the authors have considered) is to take non-pointwise or non-one-to-one kernel functions -- those that e.g. multiply multiple layer outputs. This is of course part of standard kernel machines. Because you are comparing projected datapoints, it's natural to take contrastive loss on each layer to tune the weights to maximize the distance / discrimination between different classes. Hence this is semi-supervised contrastive classification, something that is quite popular these days. The last layer is of tuned with cross-entropy labels, but only a few are required since the data is well distributed already. Demonstrated on small-ish datasets, concordant with their computational resources ... I think in general this is an important result, even if its not wholly unique / somewhat anticipated (it's a year old at the time of writing). Modular training of neural networks is great for efficiency, parallelization, and biological implementations! Transport of weights between layers is hence non-essential. Classes still are, but I wonder if temporal continuity can solve some of these problems? (There is plenty of other effort in this area -- see also {1544}) {1543} hide / / print ref: -2019 tags: backprop neural networks deep learning coordinate descent alternating minimization date: 07-21-2021 03:07 gmt revision:1  [head] This paper is sort-of interesting: rather than back-propagating the errors, you optimize auxiliary variables, pre-nonlinearity 'codes' in a last-to-first layer order. The optimization is done to minimize a multimodal logistic loss function; math is not done to minimize other loss functions, but presumably this is not a limit. The loss function also includes a quadratic term on the weights. After the 'codes' are set, optimization can proceed in parallel on the weights. This is done with either straight SGD or adaptive ADAM. Weight L2 penalty is scheduled over time. This is interesting in that the weight updates can be cone in parallel - perhaps more efficient - but you are still propagating errors backward, albeit via optimizing 'codes'. Given the vast infractructure devoted to auto-diff + backprop, I can't see this being adopted broadly. That said, the idea of alternating minimization (which is used eg for EM clustering) is powerful, and this paper does describe (though I didn't read it) how there are guarantees on the convexity of the alternating minimization. Likewise, the authors show how to improve the performance of the online / minibatch algorithm by keeping around memory variables, in the form of covariance matrices. {1535} hide / / print ref: -2019 tags: deep double descent lottery ticket date: 02-23-2021 18:47 gmt revision:2   [head] Reconciling modern machine-learning practice and the classical biasâ€“variance trade-off A formal publication of the effect famously discovered at OpenAI & publicized on their blog. Goes into some details on fourier features & runs experiments to verify the OpenAI findings. The result stands. An interesting avenue of research is using genetic algorithms to perform the search over neural network parameters (instead of backprop) in reinforcement-learning tasks. Ben Phillips has a blog post on some of the recent results, which show that it does work for certain 'hard' problems in RL. Of course, this is the dual of the 'lottery ticket' hypothesis and the deep double descent, above; large networks are likely to have solutions 'close enough' to solve a given problem. That said, genetic algorithms don't necessarily perform gradient descent to tweak the weights for optimal behaviror once they are within the right region of RL behavior. See {1530} for more discussion on this topic, as well as {1525} for a more complete literature survey. {1534} hide / / print ref: -2020 tags: current opinion in neurobiology Kriegeskorte review article deep learning neural nets circles date: 02-23-2021 17:40 gmt revision:2   [head] Going in circles is the way forward: the role of recurrence in visual inference I think the best part of this article are the references -- a nicely complete listing of, well, the current opinion in Neurobiology! (Note that this issue is edited by our own Karel Svoboda, hence there are a good number of Janelians in the author list..) The gestalt of the review is that deep neural networks need to be recurrent, not purely feed-forward. This results in savings in overall network size, and increase in the achievable computational complexity, perhaps via the incorporation of priors and temporal-spatial information. All this again makes perfect sense and matches my sense of prevailing opinion. Of course, we are left wanting more: all this recurrence ought to be structured in some way. To me, a rather naive way of thinking about it is that feed-forward layers cause weak activations, which are 'amplified' or 'selected for' in downstream neurons. These neurons proximally code for 'causes' or local reasons, based on the supported hypothesis that the brain has a good temporal-spatial model of the visuo-motor world. The causes then can either explain away the visual input, leading to balanced E-I, or fail to explain it, in which the excess activity is either rectified by engaging more circuits or engaging synaptic plasticity. A critical part of this hypothesis is some degree of binding / disentanglement / spatio-temporal re-assignment. While not all models of computation require registers / variables -- RNNs are Turning-complete, e.g., I remain stuck on the idea that, to explain phenomenological experience and practical cognition, the brain much have some means of 'binding'. A reasonable place to look is the apical tuft dendrites, which are capable of storing temporary state (calcium spikes, NMDA spikes), undergo rapid synaptic plasticity, and are so dense that they can reasonably store the outer-product space of binding. There is mounting evidence for apical tufts working independently / in parallel is investigations of high-gamma in ECoG: PMID-32851172 Dissociation of broadband high-frequency activity and neuronal firing in the neocortex. "High gamma" shows little correlation with MUA when you differentiate early-deep and late-superficial responses, "consistent with the view it reflects dendritic processing separable from local neuronal firing" {1530} hide / / print ref: -2017 tags: deep neuroevolution jeff clune Uber genetic algorithms date: 02-18-2021 18:27 gmt revision:1  [head] In this paper, they used a (fairly generic) genetic algorithm to tune the weights of a relatively large (4M parameters) convolutional neural net to play 13 atari games.Â  The GA used truncation selection, population of ~ 1k individuals, no crossover, and gaussian mutation. To speed up and streamline this algo, they encoded the weights not directly but as an initialization seed to the RNG (log2 of the number of parameters, approximately), plus seeds to generate the per-generation mutation (~ 28 bits).Â  This substantially decreased the required storage space and communication costs when running the GA in parallel on their cluster; they only had to transmit the rng seed sequence.Â  Quite surprisingly, the GA was good at typically 'hard' games like frostbite and skiing, whereas it fared poorly on games like atlantis (which is a fixed-gun shooter game) and assault.Â  Performance was compared to Deep-Q-networks (DQN), Evolutionary search (which used stochastic gradient approximates), Asynchronous Advantage Actor-critic (A3C), and random search (RS) They surmise that some games were thought to be hard, but are actually fairly easy, albeit with many local minima. This is why search around the origin (near the initialization of the networks, which was via the Xavier method) is sufficient to solve the tasks. Also noted that frequently the GA would find individuals with good performance in ~10 generations, further supporting the point above.Â  The GA provide very consistent performance across the entirety of a trial, which, they suggest, may offer a cleaner signal to selection as to the quality of each of the individuals (debatable!). Of course, for some tasks, the GA fails woefully; it was not able to quickly learn to control a humanoid robot, which involves mapping a ~370-dimensional vector into ~17 joint torques.Â  Evolutionary search was able to perform this task, which is not surprising as the gradient here should be smooth. The result is indeed surprising, but it also feels lazy -- the total effort or information that they put into writing the actual algorithm is small; as mentioned in the introduction, this is a case of old algorithms with modern levels of compute.Â  Analogously, compare Go-Explore, also by Uber AI labs, vs Agent57 by DeepMind; the Agent57 paper blithely dismisses the otherwise breathless Go-Explore result as feature engineering and unrealistic free backtracking / game-resetting (which is true..) It's strange that they did not incorporate crossover aka recombination, as David MacKay clearly shows that recombination allows for much higher mutation rates and much better transmission of information through a population.Â  (Chapter 'Why have sex').Â  They also perhaps more reasonably omit developmental encoding, where network weights are tied or controlled through development, again in an analogy to biology.Â  A better solution, as they point out, would be some sort of hybrid GA / ES / A3C system which used both gradient-based tuning, random stochastic gradient-based exploration, and straight genetic optimization, possibly all in parallel, with global selection as the umbrella.Â  They mention this, but to my current knowledge this has not been done.Â {1527} hide / / print ref: -0 tags: inductive logic programming deepmind formal propositions prolog date: 11-21-2020 04:07 gmt revision:0 [head] From a dense background of inductive logic programming (ILP): given a set of statements, and rules for transformation and substitution, generate clauses that satisfy a set of 'background knowledge'. Programs like Metagol can do this using search and simplify logic built into Prolog. Actually kinda surprising how very dense this program is -- only 330 lines! This task can be transformed into a SAT problem via rules of logic, for which there are many fast solvers. The trick here (instead) is that a neural network is used to turn 'on' or 'off' clauses that fit the background knowledge BK is typically very small, a few examples, consistent with the small size of the learned networks. These weight matrices are represented as the outer product of composed or combined clauses, which makes the weight matrix very large! They then do gradient descent, while passing the cross-entropy errors through nonlinearities (including clauses themselves? I think this is how recursion is handled.) to update the weights. Hence, SGD is used as a means of heuristic search. Compare this to Metagol, which is brittle to any noise in the input; unsurprisingly, due to SGD, this is much more robust. Way too many words and symbols in this paper for what it seems to be doing. Just seems to be obfuscating the work (which is perfectly good). Again: Metagol is only 330 lines! {1510} hide / / print ref: -2017 tags: google deepmind compositional variational autoencoder date: 04-08-2020 01:16 gmt revision:7       [head] From DeepMind, first version Jul 2017 / v3 June 2018. Starts broad and strong: "The seemingly infinite diversity of the natural world from a relatively small set of coherent rules" Relative to what? What's the order of magnitude here? In personal experience, each domain involves a large pile of relevant details.. "We conjecture that these rules dive rise to regularities that can be discovered through primarily unsupervised experiences and represented as abstract concepts" "If such representations are compositional and hierarchical, they can be recombined into an exponentially large set of new concepts." See Human level concept learning through probabalistic program induction (A much better paper which more concretely introduces compositionality in computation..) "Compositionality is at the core of such human abilities as creativity, imagination, and language-based communication. This addresses the limitations of deep learning, which are overly data hungry (low sample efficiency), tend to overfit the data, and require human supervision. Approach: Factorize the visual world with a $\Beta$ -VAE to learn a set of representational primitives through unsupervised exposure to visual data. Expose SCAN (or rather, a module of it) to a small number of symbol-image pairs, from which the algorithm identifies the set if visual primitives (features from beta-VAE) that the examples have in common. E.g. this is purely associative learning, with a finite one-layer association matrix. Test on both image 2 symbols and symbols to image directions. For the latter, allow irrelevant attributes to be filled in from the priors (this is important later in the paper..) Add in a third module, which allows learning of compositions of the features, ala set notation: AND ( $\cup$ ), IN-COMMON ( $\cap$ ) & IGNORE ( $\setminus$ or '-'). This is via a low-parameter convolutional model. Notation: $q_{\phi}(z_x|x)$ is the encoder model. $\phi$ are the encoder parameters, $x$ is the visual input, $z_x$ are the latent parameters inferred from the scene. $p_{theta}(x|z_x)$ is the decoder model. $x \propto p_{\theta}(x|z_x)$ , $\theta$ are the decoder parameters. $x$ is now the reconstructed scene. From this, the loss function of the beta-VAE is: $\mathbb{L}(\theta, \phi; x, z_x, \beta) = \mathbb{E}_{q_{\phi}(z_x|x)} [log p_{\theta}(x|z_x)] - \beta D_{KL} (q_{\phi}(z_x|x)|| p(z_x))$ where $\Beta \gt 1$ That is, maximize the auto-encoder fit (the expectation of the decoder, over the encoder output -- aka the pixel log-likelihood) minus the KL divergence between the encoder distribution and $p(z_x)$ $p(z) \propto \mathcal{N}(0, I)$ -- diagonal normal matrix. $\beta$ comes from the Lagrangian solution to the constrained optimization problem: $\max_{\phi,\theta} \mathbb{E}_{x \sim D} [\mathbb{E}_{q_{\phi}(z|x)}[log p_{\theta}(x|z)]]$ subject to $D_{KL}(q_{\phi}(z|x)||p(z)) \lt \epsilon$ where D is the domain of images etc. Claim that this loss function tips the scale too far away from accurate reconstruction with sufficient visual de-tangling (that is: if significant features correspond to small details in pixel space, they are likely to be ignored); instead they adopt the approach of the denoising auto-encoder ref, which uses the feature L2 norm instead of the pixel log-likelihood: $\mathbb{L}(\theta, \phi; X, z_x, \beta) = -\mathbb{E}_{q_{\phi}(z_x|x)}||J(\hat{x}) - J(x)||_2^2 - \beta D_{KL} (q_{\phi}(z_x|x)|| p(z_x))$ where $J : \mathbb{R}^{W x H x C} \rightarrow \mathbb{R}^N$ maps from images to high-level features. This $J(x)$ is from another neural network (transfer learning) which learns features beforehand. It's a multilayer perceptron denoising autoencoder [Vincent 2010]. The SCAN architecture includes an additional element, another VAE which is trained simultaneously on the labeled inputs $y$ and the latent outputs from encoder $z_x$ given $x$ . In this way, they can present a description $y$ to the network, which is then recomposed into $z_y$ , that then produces an image $\hat{x}$ . The whole network is trained by minimizing: $\mathbb{L}_y(\theta_y, \phi_y; y, x, z_y, \beta, \lambda) = 1^{st} - 2^{nd} - 3^{rd}$ 1st term: $\mathbb{E}_{q_{\phi_y}(z_y|y)}[log p_{\theta_y} (y|z_y)]$ log-likelihood of the decoded symbols given encoded latents $z_y$ 2nd term: $\beta D_{KL}(q_{\phi_y}(z_y|y) || p(z_y))$ weighted KL divergence between encoded latents and diagonal normal prior. 3rd term: $\lambda D_{KL}(q_{\phi_x}(z_x|y) || q_{\phi_y}(z_y|y))$ weighted KL divergence between latents from the images and latents from the description $y$ . They note that the direction of the divergence matters; I suspect it took some experimentation to see what's right. Final element! A convolutional recombination element, implemented as a tensor product between $z_{y1}$ and $z_{y2}$ that outputs a one-hot encoding of set-operation that's fed to a (hardcoded?) transformation matrix. I don't think this is great shakes. Could have done this with a small function; no need for a neural network. Trained with very similar loss function as SCAN or the beta-VAE. Testing: They seem to have used a very limited subset of "DeepMind Lab" -- all of the concept or class labels could have been implimented easily, e.g. single pixel detector for the wall color. Quite disappointing. This is marginally more interesting -- the network learns to eliminate latent factors as it's exposed to examples (just like perhaps a Bayesian network.) Similarly, the CelebA tests are meh ... not a clear improvement over the existing VAEs. {1500} hide / / print ref: -0 tags: reinforcement learning distribution DQN Deepmind dopamine date: 03-30-2020 02:14 gmt revision:5      [head] PMID-31942076 A distributional code for value in dopamine based reinforcement learning Synopsis is staggeringly simple: dopamine neurons encode / learn to encode a distribution of reward expectations, not just the mean (aka the expected value) of the reward at a given state-action pair. This is almost obvious neurally -- of course dopamine neurons in the striatum represent different levels of reward expectation; there is population diversity in nearly everything in neuroscience. The new interpretation is that neurons have different slopes for their susceptibility to positive and negative rewards (or rather, reward predictions), which results in different inflection points where the neurons are neutral about a reward. This constitutes more optimistic and pessimistic neurons. There is already substantial evidence that such a distributional representation enhances performance in DQN (Deep q-networks) from circa 2017; the innovation here is that it has been extended to experiments from 2015 where mice learned to anticipate water rewards with varying volume, or varying probability of arrival. Perfect: in these experiments there was a multimodal reward distribution. It also is an instance of the more general theme of asymmetric regression, which has found utility in deep networks (at the cost of more parameters.) See Rainbow: Combining Improvements in Deep Reinforcement learning The model predicts a diversity of asymmetry below and above the reversal point Also predicts that the distribution of reward responses should be decoded by neural activity ... which it is ... but it is not surprising that a bespoke decoder can find this information in the neural firing rates. (Have not examined in depth the decoding methods) Still, this is a clear and well-written, well-thought out paper; glad to see new parsimonious theories about dopamine out there. {1505} hide / / print ref: -2016 tags: locality sensitive hash deep learning regularization date: 03-30-2020 02:07 gmt revision:5      [head] Central idea: replace dropout, adaptive dropout, or winner-take-all with a fast (sublinear time) hash based selection of active nodes based on approximate MIPS (maximum inner product search) using asymmetric locality-sensitive hashing. This avoids a lot of the expensive inner-product multiply-accumulate work & energy associated with nodes that will either be completely off due to the ReLU or other nonlinearity -- or just not important for the algorithm + current input. The result shows that you don't need very many neurons active in a given layer for successful training. C.f: adaptive dropout adaptively chooses the nodes based on their activations. A few nodes are sampled from the network probabalistically based on the node activations dependent on their current input. Adaptive dropouts demonstrate better performance than vanilla dropout  It is possible to drop significantly more nodes adaptively than without while retaining superior performance. WTA is an extreme form of adaptive dropout that uses mini-batch statistics to enforce a sparsity constraint.  {1507} Winner take all autoencoders Our approach uses the insight that selecting a very sparse set of hidden nodes with the highest activations can be reformulated as dynamic approximate query processing, solvable with LSH. LSH can be sub-linear time; normal processing involves the inner product. LSH maps similar vectors into the same bucket with high probability. That is, it maps vectors into integers (bucket number) Similar approach: Hashed nets , which aimed to decrease the number of parameters in a network by using a universal random hash function to tie weights. Compressing neural networks with the Hashing trick "HashedNets uses a low-cost hash function to randomly group connection weights into hash buckets, and all connections within the same hash bucket share a single parameter value." Ref  shows how asymmetric hash functions allow LSH to be converted to a sub-linear time algorithm for maximum inner product search (MIPS). Used multi-probe LSH: rather than having a large number of hash tables (L) which increases hash time and memory use, they probe close-by buckets in the hash tables. That is, they probe bucket at B_j(Q) and those for slightly perturbed query Q. See ref . See reference  for theory... Following ref , use K randomized hash functions to generate the K data bits per vector. Each bit is the sign of the asymmetric random projection. Buckets contain a pointer to the node (neuron); only active buckets are kept around. The K hash functions serve to increase the precision of the fingerprint -- found nodes are more expected to be active. Have L hash tables for each hidden layer; these are used to increase the probability of finding useful / active nodes due to the randomness of the hash function. Hash is asymmetric in the sense that the query and collection data are hashed independently. In every layer during SGD, compute K x L hashes of the input, probe about 10 L buckets, and take their union. Experiments: K = 6 and L = 5. See ref  where authors show around 500x reduction in computations for image search following different algorithmic and systems choices. Capsule: a camera based positioning system using learning {1506} Use relatively small test data sets -- MNIST 8M, NORB, Convex, Rectangles -- each resized to have small-ish input vectors.  Really want more analysis of what exactly is going on here -- what happens when you change the hashing function, for example? How much is the training dependent on suitable ROC or precision/recall on the activation? For example, they could have calculated the actual real activation & WTA selection, and compared it to the results from the hash function; how correlated are they? {1482} hide / / print ref: -2019 tags: meta learning feature reuse deepmind date: 10-06-2019 04:14 gmt revision:1  [head] It's feature re-use! Show this by freezing the weights of a 5-layer convolutional network when training on Mini-imagenet, either 5shot 1 way, or 5shot 5 way. From this derive ANIL, where only the last network layer is updated in task-specific training. Show that ANIL works for basic RL learning tasks. This means that roughly the network does not benefit much from join encoding -- encoding both the task at hand and the feature set. Features can be learned independently from the task (at least these tasks), with little loss. {1441} hide / / print ref: -2018 tags: biologically inspired deep learning feedback alignment direct difference target propagation date: 03-15-2019 05:51 gmt revision:5      [head] Sergey Bartunov, Adam Santoro, Blake A. Richards, Luke Marris, Geoffrey E. Hinton, Timothy Lillicrap As is known, many algorithms work well on MNIST, but fail on more complicated tasks, like CIFAR and ImageNet. In their experiments, backprop still fares better than any of the biologically inspired / biologically plausible learning rules. This includes: Feedback alignment {1432} {1423} Vanilla target propagation Problem: with convergent networks, layer inverses (top-down) will map all items of the same class to one target vector in each layer, which is very limiting. Hence this algorithm was not directly investigated. Difference target propagation (2015) Uses the per-layer target as $\hat{h}_l = g(\hat{h}_{l+1}; \lambda_{l+1}) + [h_l - g(h_{l+1};\lambda_{l+1})]$ Or: $\hat{h}_l = h_l + g(\hat{h}_{l+1}; \lambda_{l+1}) - g(h_{l+1};\lambda_{l+1})$ where $\lambda_{l}$ are the parameters for the inverse model; $g()$ is the sum and nonlinearity. That is, the target is modified ala delta rule by the difference between inverse-propagated higher layer target and inverse-propagated higher level activity. Why? $h_{l}$ should approach $\hat{h}_{l}$ as $h_{l+1}$ approaches $\hat{h}_{l+1}$ . Otherwise, the parameters in lower layers continue to be updated even when low loss is reached in the upper layers. (from original paper). The last to penultimate layer weights is trained via backprop to prevent template impoverishment as noted above. Simplified difference target propagation The substitute a biologically plausible learning rule for the penultimate layer, $\hat{h}_{L-1} = h_{L-1} + g(\hat{h}_L;\lambda_L) - g(h_L;\lambda_L)$ where there are $L$ layers. It's the same rule as the other layers. Hence subject to impoverishment problem with low-entropy labels. Auxiliary output simplified difference target propagation Add a vector $z$ to the last layer activation, which carries information about the input vector. $z$ is just a set of random features from the activation $h_{L-1}$ . Used both fully connected and locally-connected (e.g. convolution without weight sharing) MLP. It's not so great: Target propagation seems like a weak learner, worse than feedback alignment; not only is the feedback limited, but it does not take advantage of the statistics of the input. Hence, some of these schemes may work better when combined with unsupervised learning rules. Still, in the original paper they use difference-target propagation with autoencoders, and get reasonable stroke features.. Their general result that networks and learning rules need to be tested on more difficult tasks rings true, and might well be the main point of this otherwise meh paper. {1439} hide / / print ref: -2006 tags: hinton contrastive divergence deep belief nets date: 02-20-2019 02:38 gmt revision:0 [head] PMID-16764513 A fast learning algorithm for deep belief nets. Hinton GE1, Osindero S, Teh YW. Very highly cited contrastive divergence paper. Back in 2006 yielded state of the art MNIST performance. And, being CD, can be used in an unsupervised mode. {1419} hide / / print ref: -0 tags: diffraction terahertz 3d print ucla deep learning optical neural networks date: 02-13-2019 23:16 gmt revision:1  [head] Pretty clever: use 3D printed plastic as diffractive media in a 0.4 THz all-optical all-interference (some attenuation) linear convolutional multi-layer 'neural network'. In the arxive publication there are few details on how they calculated or optimized given diffractive layers. Absence of nonlinearity will limit things greatly. Actual observed performance (where thy had to print out the handwritten digits) rather poor, ~ 60%. {1174} hide / / print ref: -0 tags: Hinton google tech talk dropout deep neural networks Boltzmann date: 02-12-2019 08:03 gmt revision:2   [head] Hinton believes in the the power of crowds -- he thinks that the brain fits many, many different models to the data, then selects afterward. Random forests, as used in predator, is an example of this: they average many simple to fit and simple to run decision trees. (is apparently what Kinect does) Talk focuses on dropout, a clever new form of model averaging where only half of the units in the hidden layers are trained for a given example. He is inspired by biological evolution, where sexual reproduction often spontaneously adds or removes genes, hence individual genes or small linked genes must be self-sufficient. This equates to a 'rugged individualism' of units. Likewise, dropout forces neurons to be robust to the loss of co-workers. This is also great for parallelization: each unit or sub-network can be trained independently, on it's own core, with little need for communication! Later, the units can be combined via genetic algorithms then re-trained. Hinton then observes that sending a real value p (output of logistic function) with probability 0.5 is the same as sending 0.5 with probability p. Hence, it makes sense to try pure binary neurons, like biological neurons in the brain. Indeed, if you replace the backpropagation with single bit propagation, the resulting neural network is trained more slowly and needs to be bigger, but it generalizes better. Neurons (allegedly) do something very similar to this by poisson spiking. Hinton claims this is the right thing to do (rather than sending real numbers via precise spike timing) if you want to robustly fit models to data. Sending stochastic spikes is a very good way to average over the large number of models fit to incoming data. Yes but this really explains little in neuroscience... Paper referred to in intro: Livnat, Papadimitriou and Feldman, PMID-19073912 and later by the same authors PMID-20080594 A mixability theory for the role of sex in evolution. -- "We define a measure that represents the ability of alleles to perform well across different combinations and, using numerical iterations within a classical population-genetic framework, show that selection in the presence of sex favors this ability in a highly robust manner" Plus David MacKay's concise illustration of why you need sex, pg 269, __Information theory, inference, and learning algorithms__ With rather simple assumptions, asexual reproduction yields 1 bit per generation, Whereas sexual reproduction yields $\sqrt G$ , where G is the genome size. {1422} hide / / print ref: -0 tags: lillicrap segregated dendrites deep learning backprop date: 01-31-2019 19:24 gmt revision:2   [head] PMID-29205151 Towards deep learning with segregated dendrites https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5716677/ Much emphasis on the problem of credit assignment in biological neural networks. That is: given complex behavior, how do upstream neurons change to improve the task of downstream neurons? Or: given downstream neurons, how do upstream neurons receive â€˜creditâ€™ for informing behavior? I find this a very limiting framework, and is one of my chief beefs with the work. Spatiotemporal Bayesian structure seems like a much better axis (axes) to cast function against. Or, it could be segregation into â€˜signalâ€™ and â€˜errorâ€™ or â€˜figure/groundâ€™ based on hierarchical spatio-temporal statistical properties that matters ... ... with proper integration of non-stochastic spike timing + neoSTDP. This still requires some solution of the credit-assignment problem, i know i know. Outline a spiking neuron model with zero one or two hidden layers, and a segregated apical (feedback) and basal (feedforward) dendrites, as per a layer 5 pyramidal neuron. The apical dendrites have plateau potentials, which are stimulated through (random) feedback weights from the output neurons. Output neurons are forced to one-hot activation at maximum firing rate during training. In order to assign credit, feedforward information must be integrated separately from any feedback signals used to calculate error for synaptic updates (the error is indicated here with Î´). (B) Illustration of the segregated dendrites proposal. Rather than using a separate pathway to calculate error based on feedback, segregated dendritic compartments could receive feedback and calculate the error signals locally. Uses the MNIST database, naturally. Poisson spiking input neurons, 784, again natch. Derive local loss function learning rules to make the plateau potential (from the feedback weights) match the feedforward potential This encourages the hidden layer -> output layer to approximate the inverse of the random feedback weight network -- which it does! (At least, the jacobians are inverses of each other). The matching is performed in two phases -- feedforward and feedback. This itself is not biologically implausible, just unlikely. Achieved moderate performance on MNIST, ~ 4%, which improved with 2 hidden layers. Very good, interesting scholarship on the relevant latest findings â€˜â€™in vivoâ€™â€™. While the model seems workable though ad-hoc or just-so, the scholarship points to something better: use of multiple neuron subtypes to accomplish different elements (variables) in the random-feedback credit assignment algorithm. These small models can be tuned to do this somewhat simple task through enough fiddling & manual (e.g. in the algorithmic space, not weight space) backpropagation of errors. They suggest that the early phases of learning may entail learning the feedback weights -- fascinating. â€˜â€™Things are definitely moving forwardâ€™â€™. {1412} hide / / print ref: -0 tags: deeplabcut markerless tracking DCN transfer learning date: 10-03-2018 23:56 gmt revision:0 [head] Human - level tracking with as few as 200 labeled frames. No dynamics - could be even better with a Kalman filter. Uses a Google-trained DCN, 50 or 101 layers deep. Network has a distinct read-out layer per feature to localize the probability of a body part to a pixel location. Uses the DeeperCut network architecture / algorithm for pose estimation. Or, rather, ResNet feature detectors -- Deep Residual Neural Networks. These deep features were trained on ImageNet Trained on examples with both only the readout layers (rest fixed per ResNet), as well as end-to-end; latter performs better, unsurprising. {1408} hide / / print ref: -2018 tags: machine learning manifold deep neural net geometry regularization date: 08-29-2018 14:30 gmt revision:0 [head] Synopsis of the math: Fit a manifold formed from the concatenated input â€˜â€™andâ€™â€™ output variables, and use this set the loss of (hence, train) a deep convolutional neural network. Manifold is fit via point integral method. This requires both SGD and variational steps -- alternate between fitting the parameters, and fitting the manifold. Uses a standard deep neural network. Measure the dimensionality of this manifold to regularize the network. Using a 'elegant trick', whatever that means. Still yet he results, in terms of error, seem not very significantly better than previous work (compared to weight decay, which is weak sauce, and dropout) That said, the results in terms of feature projection, figures 1 and 2, â€˜â€™doâ€™â€™ look clearly better. Of course, they apply the regularizer to same image recognition / classification problems (MNIST), and this might well be better adapted to something else. Not completely thorough analysis, perhaps due to space and deadlines. {1333} hide / / print ref: -0 tags: deep reinforcement learning date: 04-12-2016 17:19 gmt revision:6       [head] In general, experience replay can reduce the amount of experience required to learn, and replace it with more computation and more memory â€“ which are often cheaper resources than the RL agentâ€™s interactions with its environment. Transitions (between states) may be more or less surprising (does the system in question have a model of the environment? It does have a model of the state & action expected reward, as it's Q-learning. redundant, or task-relevant Some sundry neuroscience links: Sequences associated with rewards appear to be replayed more frequently (Atherton et al., 2015; OÌlafsdoÌttir et al., 2015; Foster & Wilson, 2006). Experiences with high magnitude TD error also appear to be replayed more often (Singer & Frank, 2009 PMID-20064396 ; McNamara et al., 2014). Pose a useful example where the task is to learn (effectively) a random series of bits -- 'Blind Cliffwalk'. By choosing the replayed experiences properly (via an oracle), you can get an exponential speedup in learning. Prioritized replay introduces bias because it changes [the sampled state-action] distribution in an uncontrolled fashion, and therefore changes the solution that the estimates will converge to (even if the policy and state distribution are fixed). We can correct this bias by using importance-sampling (IS) weights. These weights are the inverse of the priority weights, but don't matter so much at the beginning, when things are more stochastic; they anneal the controlling exponent. There are two ways of selecting (weighting) the priority weights: Direct, proportional to the TD-error encountered when visiting a sequence. Ranked, where errors and sequences are stored in a data structure ordered based on error and sampled $\propto 1 / rank$ . Somewhat illuminating is how the deep TD or Q learning is unable to even scratch the surface of Tetris or Montezuma's Revenge. {1269} hide / / print ref: -0 tags: hinton convolutional deep networks image recognition 2012 date: 01-11-2014 20:14 gmt revision:0 [head] ImageNet Classification with Deep Convolutional Networks