You are not authenticated, login.
text: sort by
tags: modified
type: chronology
[0] Schmidt EM, McIntosh JS, Durelli L, Bak MJ, Fine control of operantly conditioned firing patterns of cortical neurons.Exp Neurol 61:2, 349-69 (1978 Sep 1)[1] Serruya MD, Hatsopoulos NG, Paninski L, Fellows MR, Donoghue JP, Instant neural control of a movement signal.Nature 416:6877, 141-2 (2002 Mar 14)[2] Fetz EE, Operant conditioning of cortical unit activity.Science 163:870, 955-8 (1969 Feb 28)[3] Fetz EE, Finocchio DV, Operant conditioning of specific patterns of neural and muscular activity.Science 174:7, 431-5 (1971 Oct 22)[4] Fetz EE, Finocchio DV, Operant conditioning of isolated activity in specific muscles and precentral cells.Brain Res 40:1, 19-23 (1972 May 12)[5] Fetz EE, Baker MA, Operantly conditioned patterns on precentral unit activity and correlated responses in adjacent cells and contralateral muscles.J Neurophysiol 36:2, 179-204 (1973 Mar)

[0] Bar-Gad I, Morris G, Bergman H, Information processing, dimensionality reduction and reinforcement learning in the basal ganglia.Prog Neurobiol 71:6, 439-73 (2003 Dec)

[0] Won DS, Wolf PD, A simulation study of information transmission by multi-unit microelectrode recordings.Network 15:1, 29-44 (2004 Feb)

hide / / print
ref: -2019 tags: HSIC information bottleneck deep learning backprop gaussian kernel date: 10-06-2021 17:23 gmt revision:5 [4] [3] [2] [1] [0] [head]

The HSIC Bottleneck: Deep learning without Back-propagation

In this work, the authors use a kernelized estimate of statistical independence as part of a 'information bottleneck' to set per-layer objective functions for learning useful features in a deep network. They use the HSIC, or Hilbert-schmidt independence criterion, as the independence measure.

The information bottleneck was proposed by Bailek (spikes..) et al in 1999, and aims to increase the mutual information between the layer representation and the labels while minimizing the mutual information between the representation and the input:

minP T i|XI(X;T i)βI(T i;Y)\frac{min}{P_{T_i | X}} I(X; T_i) - \beta I(T_i; Y)

Where T iT_i is the hidden representation at layer i (later output), XX is the layer input, and YY are the labels. By replacing I()I() with the HSIC, and some derivation (?), they show that

HSIC(D)=(m1) 2tr(K XHK YH)HSIC(D) = (m-1)^{-2} tr(K_X H K_Y H)

Where D=(x 1,y 1),...(x m,y m)D = {(x_1,y_1), ... (x_m, y_m)} are samples and labels, K X ij=k(x i,x j)K_{X_{ij}} = k(x_i, x_j) and K Y ij=k(y i,y j)K_{Y_{ij}} = k(y_i, y_j) -- that is, it's the kernel function applied to all pairs of (vectoral) input variables. H is the centering matrix. The kernel is simply a Gaussian kernel, k(x,y)=exp(1/2||xy|| 2/σ 2)k(x,y) = exp(-1/2 ||x-y||^2/\sigma^2) . So, if all the x and y are on average independent, then the inner-product will be mean zero, the kernel will be mean one, and after centering will lead to zero trace. If the inner product is large within the realm of the derivative of the kernel, then the HSIC will be large (and negative, i think). In practice they use three different widths for their kernel, and they also center the kernel matrices.

But still, the feedback is an aggregate measure (the trace) of the product of two kernelized (a nonlinearity) outer-product spaces of similarities between inputs. it's not unimaginable that feedback networks could be doing something like this...

For example, a neural network could calculate & communicate aspects of joint statistics to reward / penalize weights within a layer of a network, and this is parallelizable / per layer / adaptable to an unsupervised learning regime. Indeed, that was done almost exactly by this paper: Kernelized information bottleneck leads to biologically plausible 3-factor Hebbian learning in deep networks albeit in a much less intelligible way.

Robust Learning with the Hilbert-Schmidt Independence Criterion

Is another, later, paper using the HSIC. Their interpretation: "This loss-function encourages learning models where the distribution of the residuals between the label and the model prediction is statistically independent of the distribution of the instances themselves." Hence, given above nomenclature, E X(P T i|XI(X;T i))=0 E_X( P_{T_i | X} I(X ; T_i) ) = 0 (I'm not totally sure about the weighting, but might be required given the definition of the HSIC.)

As I understand it, the HSIC loss is a kernellized loss between the input, output, and labels that encourages a degree of invariance to input ('covariate shift'). This is useful, but I'm unconvinced that making the layer output independent of the input is absolutely essential (??)

hide / / print
ref: Schmidt-1978.09 tags: Schmidt BMI original operant conditioning cortex HOT pyramidal information antidromic date: 03-12-2019 23:35 gmt revision:11 [10] [9] [8] [7] [6] [5] [head]

PMID-101388[0] Fine control of operantly conditioned firing patterns of cortical neurons.

  • Hand-arm area of M1, 11 or 12 chronic recording electrodes, 3 monkeys.
    • But, they only used one unit at a time in the conditioning task.
  • Observed conditioning in 77% of single units and 65% of combined units (multiunits?).
  • Trained to move a handle to a position indicated by 8 annular cursor lights.
    • Cursor was updated at 50hz -- this was just a series of lights! talk about simple feedback...
    • Investigated different smoothing: too fast, FR does not stay in target; too slow, cursor acquires target too slowly.
      • My gamma function is very similar to their lowpass filter used for smoothing the firing rates.
    • 4 or 8 target random tracking task
    • Time-out of 8 seconds
    • Run of 40 trials
      • The conditioning reached a significant level of performance after 2.2 runs of 40 trials (in well-trained monkeys); typically, they did 18 runs/day (720 trials)
  • Recordings:
    • Scalar mapping of unit firing rate to cursor position.
    • Filtered 600-6kHz
    • Each accepted spike triggered a generator that produced a pulse of of constant amplitude and width -> this was fed into a lowpass filter (1.5 to 2.5 & 3.5Hz cutoff), and a gain stage, then a ADC, then (presumably) the PDP.
      • can determine if these units were in the pyramidal tract by measuring antidromic delay.
    • recorded one neuron for 108 days!!
      • Neuronal activity is still being recorded from one monkey 24 months after chronic implantation of the microelectrodes.
    • Average period in which conditioning was attempted was 3.12 days.
  • Successful conditioning was always associated with specific repeatable limb movements
    • "However, what appears to be conditioned in these experiments is a movement, and the neuron under study is correlated with that movement." YES.
    • The monkeys clearly learned to make (increasingly refined) movement to modulate the firing activity of the recorded units.
    • The monkey learned to turn off certain units with specific limb positions; the monkey used exaggerated movements for these purposes.
      • e.g. finger and shoulder movements, isometric contraction in one case.
  • Trained some monkeys or > 15 months; animals got better at the task over time.
  • PDP-12 computer.
  • Information measure: 0 bits for missed targets, 2 for a 4 target task, 3 for 8 target task; information rate = total number of bits / time to acquire targets.
    • 3.85 bits/sec peak with 4 targets, 500ms hold time
    • With this, monkeys were able to exert fine control of firing rate.
    • Damn! compare to Paninski! [1]
  • 4.29 bits/sec when the same task was performed with a manipulandum & wrist movement
  • they were able to condition 77% of individual neurons and 65% of combined units.
  • Implanted a pyramidal tract electrode in one monkey; both cells recorded at that time were pyramidal tract neurons, antidromic latencies of 1.2 - 1.3ms.
    • Failures had no relation to over movements of the monkey.
  • Fetz and Baker [2,3,4,5] found that 65% of precentral neurons could be conditioned for increased or decreased firing rates.
    • and it only took 6.5 minutes, on average, for the units to change firing rates!
  • Summarized in [1].


hide / / print
ref: BarGad-2003.12 tags: information dimensionality reduction reinforcement learning basal_ganglia RDDR SNR globus pallidus date: 01-16-2012 19:18 gmt revision:3 [2] [1] [0] [head]

PMID-15013228[] Information processing, dimensionality reduction, and reinforcement learning in the basal ganglia (2003)

  • long paper! looks like they used latex.
  • they focus on a 'new model' for the basal ganglia: reinforcement driven dimensionality reduction (RDDR)
  • in order to make sense of the system - according to them - any model must ingore huge ammounts of information about the studied areas.
  • ventral striatum = nucelus accumbens!
  • striatum is broken into two, rough, parts: ventral and dorsal
    • dorsal striatum: the caudate and putamen are a part of the
    • ventral striatum: the nucelus accumbens, medial and ventral portions of the caudate and putamen, and striatal cells of the olifactory tubercle (!) and anterior perforated substance.
  • ~90 of neurons in the striatum are medium spiny neurons
    • dendrites fill 0.5mm^3
    • cells have up and down states.
      • the states are controlled by intrinsic connections
      • project to GPe GPi & SNr (primarily), using GABA.
  • 1-2% of neurons in the striatum are tonically active neurons (TANs)
    • use acetylcholine (among others)
    • fewer spines
    • more sensitive to input
    • TANs encode information relevant to reinforcement or incentive behavior


hide / / print
ref: work-0 tags: gaussian random variables mutual information SNR date: 01-16-2012 03:54 gmt revision:26 [25] [24] [23] [22] [21] [20] [head]

I've recently tried to determine the bit-rate of conveyed by one gaussian random process about another in terms of the signal-to-noise ratio between the two. Assume x x is the known signal to be predicted, and y y is the prediction.

Let's define SNR(y)=Var(x)Var(err) SNR(y) = \frac{Var(x)}{Var(err)} where err=xy err = x-y . Note this is a ratio of powers; for the conventional SNR, SNR dB=10*log 10Var(x)Var(err) SNR_{dB} = 10*log_{10 } \frac{Var(x)}{Var(err)} . Var(err)Var(err) is also known as the mean-squared-error (mse).

Now, Var(err)=(xyerr¯) 2=Var(x)+Var(y)2Cov(x,y) Var(err) = \sum{ (x - y - sstrch \bar{err})^2 estrch} = Var(x) + Var(y) - 2 Cov(x,y) ; assume x and y have unit variance (or scale them so that they do), then

2SNR(y) 12=Cov(x,y) \frac{2 - SNR(y)^{-1}}{2 } = Cov(x,y)

We need the covariance because the mutual information between two jointly Gaussian zero-mean variables can be defined in terms of their covariance matrix: (see http://www.springerlink.com/content/v026617150753x6q/ ). Here Q is the covariance matrix,

Q=[Var(x) Cov(x,y) Cov(x,y) Var(y)] Q = \left[ \array{Var(x) & Cov(x,y) \\ Cov(x,y) & Var(y)} \right]

MI=12logVar(x)Var(y)det(Q) MI = \frac{1 }{2 } log \frac{Var(x) Var(y)}{det(Q)}

Det(Q)=1Cov(x,y) 2 Det(Q) = 1 - Cov(x,y)^2

Then MI=12log 2[1Cov(x,y) 2] MI = - \frac{1 }{2 } log_2 \left[ 1 - Cov(x,y)^2 \right]

or MI=12log 2[SNR(y) 114SNR(y) 2] MI = - \frac{1 }{2 } log_2 \left[ SNR(y)^{-1} - \frac{1 }{4 } SNR(y)^{-2} \right]

This agrees with intuition. If we have a SNR of 10db, or 10 (power ratio), then we would expect to be able to break a random variable into about 10 different categories or bins (recall stdev is the sqrt of the variance), with the probability of the variable being in the estimated bin to be 1/2. (This, at least in my mind, is where the 1/2 constant comes from - if there is gaussian noise, you won't be able to determine exactly which bin the random variable is in, hence log_2 is an overestimator.)

Here is a table with the respective values, including the amplitude (not power) ratio representations of SNR. "

SNRAmp. ratioMI (bits)
Note that at 90dB, you get about 15 bits of resolution. This makes sense, as 16-bit DACs and ADCs have (typically) 96dB SNR. good.

Now, to get the bitrate, you take the SNR, calculate the mutual information, and multiply it by the bandwidth (not the sampling rate in a discrete time system) of the signals. In our particular application, I think the bandwidth is between 1 and 2 Hz, hence we're getting 1.6-3.2 bits/second/axis, hence 3.2-6.4 bits/second for our normal 2D tasks. If you read this blog regularly, you'll notice that others have achieved 4bits/sec with one neuron and 6.5 bits/sec with dozens {271}.

hide / / print
ref: bookmark-0 tags: machine_learning research_blog parallel_computing bayes active_learning information_theory reinforcement_learning date: 12-31-2011 19:30 gmt revision:3 [2] [1] [0] [head]

hunch.net interesting posts:

  • debugging your brain - how to discover what you don't understand. a very intelligent viewpoint, worth rereading + the comments. look at the data, stupid
    • quote: how to represent the problem is perhaps even more important in research since human brains are not as adept as computers at shifting and using representations. Significant initial thought on how to represent a research problem is helpful. And when it’s not going well, changing representations can make a problem radically simpler.
  • automated labeling - great way to use a human 'oracle' to bootstrap us into good performance, esp. if the predictor can output a certainty value and hence ask the oracle all the 'tricky questions'.
  • The design of an optimal research environment
    • Quote: Machine learning is a victim of it’s common success. It’s hard to develop a learning algorithm which is substantially better than others. This means that anyone wanting to implement spam filtering can do so. Patents are useless here—you can’t patent an entire field (and even if you could it wouldn’t work).
  • More recently: http://hunch.net/?p=2016
    • Problem is that online course only imperfectly emulate the social environment of a college, which IMHO are useflu for cultivating diligence.
  • The unrealized potential of the research lab Quote: Muthu Muthukrishnan says “it’s the incentives”. In particular, people who invent something within a research lab have little personal incentive in seeing it’s potential realized so they fail to pursue it as vigorously as they might in a startup setting.
    • The motivation (money!) is just not there.

hide / / print
ref: Won-2004.02 tags: Debbie Won Wolf spike sorting mutual information tuning BMI date: 12-07-2011 02:58 gmt revision:3 [2] [1] [0] [head]

PMID-15022843[0] A simulation study of information transmission by multi-unit microelectrode recordings key idea:

  • when the units on a single channel are similarly tuned, you don't loose much information by grouping all spikes as coming from one source. And the opposite effect is true when you have very differently tuned neurons on the same channel - the information becomes more ambiguous.


hide / / print
ref: notes-0 tags: neuroscience ion channels information coding John Harris date: 01-07-2008 16:46 gmt revision:4 [3] [2] [1] [0] [head]

  • crazy idea: that neurons have a number of ion channel lines which can be selectively activated. That is, information is transmitted by longitudial transmission channels which are selectively activated based on the message that is transmitted
  • has any evidence for such a fine structure been found?? I think not, due to binding studies, but who knows..
  • dude uses historical references (Neumann) to back up his ideas. I find these sorts of justifications interesting, but not logically substantiative. Do not talk about the opinions of old philosophers (exclusively, at least), talk about their data.
  • interesting story about holography & the holograph of Dennis Gabor.
    • he does make interesting analogies to neuroscience & the importance of preserving spatial phase.
  • fourier images -- neato.
conclusion: interesting, but a bit cooky.

hide / / print
ref: notes-0 tags: SNR MSE error multidimensional mutual information date: 03-08-2007 22:33 gmt revision:2 [1] [0] [head]

http://ieeexplore.ieee.org/iel5/516/3389/00116771.pdf or http://hardm.ath.cx:88/pdf/MultidimensionalSNR.pdf

  • the signal-to-noise ratio between two vectors is the ratio of the determinants of the correlation matrices. Just see equation 14.

hide / / print
ref: bookmark-0 tags: book information_theory machine_learning bayes probability neural_networks mackay date: 0-0-2007 0:0 revision:0 [head]

http://www.inference.phy.cam.ac.uk/mackay/itila/book.html -- free! (but i liked the book, so I bought it :)

hide / / print
ref: bookmark-0 tags: information entropy bit rate matlab code date: 0-0-2006 0:0 revision:0 [head]


  • concise, well documented, useful.
  • number of bins = length of vector ^ (1/3).
  • information = sum(log (bincounts / prior) * bincounts) -- this is just the divergence, same as I do it.

hide / / print
ref: bookmark-0 tags: machine_learning classification entropy information date: 0-0-2006 0:0 revision:0 [head]

http://iridia.ulb.ac.be/~lazy/ -- Lazy Learning.