m8ta
You are not authenticated, login. |
|
{1544} | ||||||||||||||||||||
The HSIC Bottleneck: Deep learning without Back-propagation In this work, the authors use a kernelized estimate of statistical independence as part of a 'information bottleneck' to set per-layer objective functions for learning useful features in a deep network. They use the HSIC, or Hilbert-schmidt independence criterion, as the independence measure. The information bottleneck was proposed by Bailek (spikes..) et al in 1999, and aims to increase the mutual information between the layer representation and the labels while minimizing the mutual information between the representation and the input:
Where is the hidden representation at layer i (later output), is the layer input, and are the labels. By replacing with the HSIC, and some derivation (?), they show that
Where are samples and labels, and -- that is, it's the kernel function applied to all pairs of (vectoral) input variables. H is the centering matrix. The kernel is simply a Gaussian kernel, . So, if all the x and y are on average independent, then the inner-product will be mean zero, the kernel will be mean one, and after centering will lead to zero trace. If the inner product is large within the realm of the derivative of the kernel, then the HSIC will be large (and negative, i think). In practice they use three different widths for their kernel, and they also center the kernel matrices. But still, the feedback is an aggregate measure (the trace) of the product of two kernelized (a nonlinearity) outer-product spaces of similarities between inputs. it's not unimaginable that feedback networks could be doing something like this... For example, a neural network could calculate & communicate aspects of joint statistics to reward / penalize weights within a layer of a network, and this is parallelizable / per layer / adaptable to an unsupervised learning regime. Indeed, that was done almost exactly by this paper: Kernelized information bottleneck leads to biologically plausible 3-factor Hebbian learning in deep networks albeit in a much less intelligible way. Robust Learning with the Hilbert-Schmidt Independence Criterion Is another, later, paper using the HSIC. Their interpretation: "This loss-function encourages learning models where the distribution of the residuals between the label and the model prediction is statistically independent of the distribution of the instances themselves." Hence, given above nomenclature, (I'm not totally sure about the weighting, but might be required given the definition of the HSIC.) As I understand it, the HSIC loss is a kernellized loss between the input, output, and labels that encourages a degree of invariance to input ('covariate shift'). This is useful, but I'm unconvinced that making the layer output independent of the input is absolutely essential (??) | ||||||||||||||||||||
{305} | ||||||||||||||||||||
PMID-101388[0] Fine control of operantly conditioned firing patterns of cortical neurons.
____References____ | ||||||||||||||||||||
{255} |
ref: BarGad-2003.12
tags: information dimensionality reduction reinforcement learning basal_ganglia RDDR SNR globus pallidus
date: 01-16-2012 19:18 gmt
revision:3
[2] [1] [0] [head]
|
|||||||||||||||||||
PMID-15013228[] Information processing, dimensionality reduction, and reinforcement learning in the basal ganglia (2003)
____References____ | ||||||||||||||||||||
{806} | ||||||||||||||||||||
I've recently tried to determine the bit-rate of conveyed by one gaussian random process about another in terms of the signal-to-noise ratio between the two. Assume is the known signal to be predicted, and is the prediction. Let's define where . Note this is a ratio of powers; for the conventional SNR, . is also known as the mean-squared-error (mse). Now, ; assume x and y have unit variance (or scale them so that they do), then
We need the covariance because the mutual information between two jointly Gaussian zero-mean variables can be defined in terms of their covariance matrix: (see http://www.springerlink.com/content/v026617150753x6q/ ). Here Q is the covariance matrix,
Then or This agrees with intuition. If we have a SNR of 10db, or 10 (power ratio), then we would expect to be able to break a random variable into about 10 different categories or bins (recall stdev is the sqrt of the variance), with the probability of the variable being in the estimated bin to be 1/2. (This, at least in my mind, is where the 1/2 constant comes from - if there is gaussian noise, you won't be able to determine exactly which bin the random variable is in, hence log_2 is an overestimator.) Here is a table with the respective values, including the amplitude (not power) ratio representations of SNR. "
Now, to get the bitrate, you take the SNR, calculate the mutual information, and multiply it by the bandwidth (not the sampling rate in a discrete time system) of the signals. In our particular application, I think the bandwidth is between 1 and 2 Hz, hence we're getting 1.6-3.2 bits/second/axis, hence 3.2-6.4 bits/second for our normal 2D tasks. If you read this blog regularly, you'll notice that others have achieved 4bits/sec with one neuron and 6.5 bits/sec with dozens {271}. | ||||||||||||||||||||
{5} |
ref: bookmark-0
tags: machine_learning research_blog parallel_computing bayes active_learning information_theory reinforcement_learning
date: 12-31-2011 19:30 gmt
revision:3
[2] [1] [0] [head]
|
|||||||||||||||||||
hunch.net interesting posts:
| ||||||||||||||||||||
{252} | ||||||||||||||||||||
PMID-15022843[0] A simulation study of information transmission by multi-unit microelectrode recordings key idea:
____References____ | ||||||||||||||||||||
{530} | ||||||||||||||||||||
| ||||||||||||||||||||
{229} |
ref: notes-0
tags: SNR MSE error multidimensional mutual information
date: 03-08-2007 22:33 gmt
revision:2
[1] [0] [head]
|
|||||||||||||||||||
http://ieeexplore.ieee.org/iel5/516/3389/00116771.pdf or http://hardm.ath.cx:88/pdf/MultidimensionalSNR.pdf
| ||||||||||||||||||||
{7} |
ref: bookmark-0
tags: book information_theory machine_learning bayes probability neural_networks mackay
date: 0-0-2007 0:0
revision:0
[head]
|
|||||||||||||||||||
http://www.inference.phy.cam.ac.uk/mackay/itila/book.html -- free! (but i liked the book, so I bought it :) | ||||||||||||||||||||
{57} | ||||||||||||||||||||
http://www.cs.rug.nl/~rudy/matlab/
| ||||||||||||||||||||
{66} |
ref: bookmark-0
tags: machine_learning classification entropy information
date: 0-0-2006 0:0
revision:0
[head]
|
|||||||||||||||||||
http://iridia.ulb.ac.be/~lazy/ -- Lazy Learning. |