 m8ta
 {1564} hide / / print ref: -2008 tags: t-SNE dimensionality reduction embedding Hinton date: 01-25-2022 20:39 gmt revision:2   [head] “Visualizing data using t-SNE” Laurens van der Maaten, Geoffrey Hinton. SNE: stochastic neighbor embedding, Hinton 2002. Idea: model the data conditional pairwise distribution as a gaussian, with one variance per data point, $p(x_i | x_j)$ in the mapped data, this pairwise distribution is modeled as a fixed-variance gaussian, too, $q(y_i | y_j)$ Goal is to minimize the Kullback-Leibler divergence $\Sigma_i KL(p_i || q_i)$ (summed over all data points) Per-data point variance is found via binary search to match a user-specified perplexity. This amounts to setting a number of nearest neighbors, somewhere between 5 and 50 work ok. Cost function is minimized via gradient descent, starting with a random distribution of points yi, with plenty of momentum to speed up convergence, and noise to effect simulated annealing. Cost function is remarkably simple to reduce, gradient update: $\frac{\delta C}{\delta y_i} = 2 \Sigma_j(p_{j|i} - q_{j-i} + p_{i|j} - q_{i|j})(y_i - y_j)$ t-SNE differs from SNE (above) in that it addresses difficulty in optimizing the cost function, and crowding. Uses a simplified symmetric cost function (symmetric conditional probability, rather than joint probability) with simpler gradients Uses the student’s t-distribution in the low-dimensional map q to reduce crowding problem. The crowding problem is roughly resultant from the fact that, in high-dimensional spaces, the volume of the local neighborhood scales as $r^m$ , whereas in 2D, it’s just $r^2$ . Hence there is cost-incentive to pushing all the points together in the map -- points are volumetrically closer together in high dimensions than they can be in 2D. This can be alleviated by using a one-DOF student distribution, which is the same as a Cauchy distribution, to model the probabilities in map space. Smart -- they plot the topology of the gradients to gain insight into modeling / convergence behavior. Don’t need simulated annealing due to balanced attractive and repulsive effects (see figure). Enhance the algorithm further by keeping it compact at the beginning, so that clusters can move through each other. Look up: d-bits parity task by Bengio 2007 {1439} hide / / print ref: -2006 tags: hinton contrastive divergence deep belief nets date: 02-20-2019 02:38 gmt revision:0 [head] PMID-16764513 A fast learning algorithm for deep belief nets. Hinton GE1, Osindero S, Teh YW. Very highly cited contrastive divergence paper. Back in 2006 yielded state of the art MNIST performance. And, being CD, can be used in an unsupervised mode. {1174} hide / / print ref: -0 tags: Hinton google tech talk dropout deep neural networks Boltzmann date: 02-12-2019 08:03 gmt revision:2   [head] Hinton believes in the the power of crowds -- he thinks that the brain fits many, many different models to the data, then selects afterward. Random forests, as used in predator, is an example of this: they average many simple to fit and simple to run decision trees. (is apparently what Kinect does) Talk focuses on dropout, a clever new form of model averaging where only half of the units in the hidden layers are trained for a given example. He is inspired by biological evolution, where sexual reproduction often spontaneously adds or removes genes, hence individual genes or small linked genes must be self-sufficient. This equates to a 'rugged individualism' of units. Likewise, dropout forces neurons to be robust to the loss of co-workers. This is also great for parallelization: each unit or sub-network can be trained independently, on it's own core, with little need for communication! Later, the units can be combined via genetic algorithms then re-trained. Hinton then observes that sending a real value p (output of logistic function) with probability 0.5 is the same as sending 0.5 with probability p. Hence, it makes sense to try pure binary neurons, like biological neurons in the brain. Indeed, if you replace the backpropagation with single bit propagation, the resulting neural network is trained more slowly and needs to be bigger, but it generalizes better. Neurons (allegedly) do something very similar to this by poisson spiking. Hinton claims this is the right thing to do (rather than sending real numbers via precise spike timing) if you want to robustly fit models to data. Sending stochastic spikes is a very good way to average over the large number of models fit to incoming data. Yes but this really explains little in neuroscience... Paper referred to in intro: Livnat, Papadimitriou and Feldman, PMID-19073912 and later by the same authors PMID-20080594 A mixability theory for the role of sex in evolution. -- "We define a measure that represents the ability of alleles to perform well across different combinations and, using numerical iterations within a classical population-genetic framework, show that selection in the presence of sex favors this ability in a highly robust manner" Plus David MacKay's concise illustration of why you need sex, pg 269, __Information theory, inference, and learning algorithms__ With rather simple assumptions, asexual reproduction yields 1 bit per generation, Whereas sexual reproduction yields $\sqrt G$ , where G is the genome size. {1269} hide / / print ref: -0 tags: hinton convolutional deep networks image recognition 2012 date: 01-11-2014 20:14 gmt revision:0 [head] ImageNet Classification with Deep Convolutional Networks