m8ta
You are not authenticated, login.
text: sort by
tags: modified
type: chronology
{1554}
hide / / print
ref: -2021 tags: FIBSEM electron microscopy presynaptic plasticity activity Funke date: 10-12-2021 17:03 gmt revision:0 [head]

Ultrastructural readout of in vivo synaptic activity for functional connectomics

  • Anna Simon, Arnd Roth, Arlo Sheridan, Mehmet Fişek, Vincenzo Marra, Claudia Racca, View ORCID ProfileJan Funke, View ORCID ProfileKevin Staras, Michael Häusser
  • Did FIB-SEM on FM1-43 dye labeled synapses, then segmented the cells using machine learning, as Jan has pioneered.
    • FM1-43FX is membrane impermeable, and labels only synaptic vesicles that have been recycled after dye loading. (Invented in 1992!)
    • FM1-43FX is also able to photoconvert diaminobenzidene (DAB) into a amorphous highly conjugated polymer with high affinity for osmium tetroxide
  • This allows for a snapshot of ultrastructural presynaptic plasticity / activity.
  • N=84 boutons, but n=7 pairs / triples of boutons from the same axon.
    • These boutons have the same presynaptic spiking activity, and hence are expected to have the same release probability, and hence the same photoconversion (PC) labeling.
      • But they don't! The ratio of PC+ vesicle numbers between boutons on the same neuron is low, mean < 0.4, which suggests some boutons have high neurotransmitter release and recycling, others have low...
  • Quote in the abstract: We also demonstrate that neighboring boutons of the same axon, which share the same spiking activity, can differ greatly in their presynaptic release probability.
    • Well, sorta, the data here is a bit weak. It might all be lognormal fluctuations, as has been well demonstrated.
    • When I read it I was excited to think of the influence of presynaptic inhibition / modulation, which has not been measured here, but is likely to be important.

{1529}
hide / / print
ref: -2020 tags: dreamcoder ellis program induction ai tenenbaum date: 10-10-2021 17:32 gmt revision:2 [1] [0] [head]

DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning

  • Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lucas Morales, Luke Hewitt, Armando Solar-Lezama, Joshua B. Tenenbaum

This paper describes a system for adaptively finding programs which succinctly and accurately produce desired output. These desired outputs are provided by the user / test system, and come from a number of domains:

  • list (as in lisp) processing,
  • text editing,
  • regular expressions,
  • line graphics,
  • 2d lego block stacking,
  • symbolic regression (ish),
  • functional programming,
  • and physcial laws.
Some of these domains are naturally toy-like, eg. the text processing, but others are deeply impressive: the system was able to "re-derive" basic physical laws of vector calculus in the process of looking for S-expression forms of cheat-sheet physics equations. These advancements result from a long lineage of work, perhaps starting from the Helmholtz machine PMID-7584891 introduced by Peter Dayan, Geoff Hinton and others, where onemodel is trained to generate patterns given context (e.g.) while a second recognition module is trained to invert this model: derive context from the patterns. The two work simultaneously to allow model-exploration in high dimensions.

Also in the lineage is the EC2 algorithm, which most of the same authors above published in 2018. EC2 centers around the idea of "explore - compress" : explore solutions to your program induction problem during the 'wake' phase, then compress the observed programs into a library by extracting/factoring out commonalities during the 'sleep' phase. This of course is one of the core algorithms of human learning: explore options, keep track of both what worked and what didn't, search for commonalities among the options & their effects, and use these inferred laws or heuristics to further guide search and goal-setting, thereby building a buffer attack the curse of dimensionality. Making the inferred laws themselves functions in a programming library allows hierarchically factoring the search task, making exploration of unbounded spaces possible. This advantage is unique to the program synthesis approach.

This much is said in the introduction, though perhaps with more clarity. DreamCoder is an improved, more-accessible version of EC2, though the underlying ideas are the same. It differs in that the method for constructing libraries has improved through the addition of a powerful version space for enumerating and evaluating refactors of the solutions generated during the wake phase. (I will admit that I don't much understand the version space system.) This version space allows DreamCoder to collapse the search space for re-factorings by many orders of magnitude, and seems to be a clear advancement. Furthermore, DreamCoder incorporates a second phase of sleep: "dreaming", hence the moniker. During dreaming the library is used to create 'dreams' consisting of combinations of the library primitives, which are then executed with training data as input. These dreams are then used to train up a neural network to predict which library and atomic objects to use in given contexts. Context in this case is where in the parse tree a given object has been inserted (it's parent and which argument number it sits in); how the data-context is incorporated to make this decision is not clear to me (???).

This neural dream and replay-trained neural network is either a GRU recurrent net with 64 hidden states, or a convolutional network feeding into a RNN. The final stage is a linear ReLu (???) which again is not clear how it feeds into the prediction of "which unit to use when". The authors clearly demonstrate that the network, or the probabalistic context-free grammar that it controls (?) is capable of straightforward optimizations, like breaking symmetries due to commutativity, avoiding adding zero, avoiding multiplying by one, etc. Beyond this, they do demonstrate via an ablation study that the presence of the neural network affords significant algorithmic leverage in all of the problem domains tested. The network also seems to learn a reasonable representation of the sub-type of task encountered -- but a thorough investigation of how it works, or how it might be made to work better, remains desired.

I've spent a little time looking around the code, which is a mix of python high-level experimental control code, and lower-level OCaml code responsible for running (emulating) the lisp-like DSL, inferring type in it's polymorphic system / reconciling types in evaluated program instances, maintaining the library, and recompressing it using aforementioned version spaces. The code, like many things experimental, is clearly a work-in progress, with some old or unused code scattered about, glue to run the many experiments & record / analyze the data, and personal notes from the first author for making his job talks (! :). The description in the supplemental materials, which is satisfyingly thorough (if again impenetrable wrt version spaces), is readily understandable, suggesting that one (presumably the first) author has a clear understanding of the system. It doesn't appear that much is being hidden or glossed over, which is not the case for all scientific papers.


With the caveat that I don't claim to understand the system to completion, there are some clear areas where the existing system could be augmented further. The 'recognition' or perceptual module, which guides actual synthesis of candidate programs, realistically can use as much information as is available in DreamCoder as is available: full lexical and semantic scope, full input-output specifications, type information, possibly runtime binding of variables when filling holes. This is motivated by the way that humans solve problems, at least as observed by introspection:
  • Examine problem, specification; extract patterns (via perceptual modules)
  • Compare patterns with existing library (memory) of compositionally-factored 'useful solutions' (this is identical to the library in DreamCoder)* Do something like beam-search or quasi stochastic search on selected useful solutions. This is the same as DreamCoder, however human engineers make decisions progressively, at runtime so-to-speak: you fill not one hole per cycle, but many holes. The addition of recursion to DreamCoder, provided a wider breadth of input information, could support this functionality.
  • Run the program to observe input-output .. but also observe the inner workings of the program, eg. dataflow patterns. These dataflow patterns are useful to human engineers when both debugging and when learning-by-inspection what library elements do. DreamCoder does not really have this facility.
  • Compare the current program results to the desired program output. Make a stochastic decision whether to try to fix it, or to try another beam in the search. Since this would be on a computer, this could be in parallel (as DreamCoder is); the ability to 'fix' or change a DUT is directly absent dreamcoder. As an 'deeply philosophical' aside, this loop itself might be the effect of running a language-of-thought program, as was suggested by pioneers in AI (ref). The loop itself is subject to modification and replacement based on goal-seeking success in the domain of interest, in a deeply-satisfying and deeply recursive manner ...
At each stage in the pipeline, the perceptual modules would have access to relevant variables in the current problem-solving context. This is modeled on Jacques Pitrat's work. Humans of course are even more flexible than that -- context includes roughly the whole brain, and if anything we're mushy on which level of the hierarchy we are working.

Critical to making this work is to have, as I've written in my notes many years ago, a 'self compressing and factorizing memory'. The version space magic + library could be considered a working example of this. In the realm of ANNs, per recent OpenAI results with CLIP and Dall-E, really big transformers also seem to have strong compositional abilities, with the caveat that they need to be trained on segments of the whole web. (This wouldn't be an issue here, as Dreamcoder generates a lot of its own training data via dreams). Despite the data-inefficiency of DNN / transformers, they should be sufficient for making something in the spirit of above work, with a lot of compute, at least until more efficient models are available (which they should be shortly; see AlphaZero vs MuZero).

{1553}
hide / / print
ref: -2020 tags: excitatory inhibitory balance E-I synapses date: 10-06-2021 17:50 gmt revision:1 [0] [head]

Whole-Neuron Synaptic Mapping Reveals Spatially Precise Excitatory/Inhibitory Balance Limiting Dendritic and Somatic Spiking

We mapped over 90,000 E and I synapses across twelve L2/3 PNs and uncovered structured organization of E and I synapses across dendritic domains as well as within individual dendritic segments. Despite significant domain-specific variation in the absolute density of E and I synapses, their ratio is strikingly balanced locally across dendritic segments. Computational modeling indicates that this spatially precise E/I balance dampens dendritic voltage fluctuations and strongly impacts neuronal firing output.

I think this would be tenuous, but they did do patch-clamp recording to back it up, but it's vitally interesting from a structural standpoint. Plus, this is a enjoyable, well-written paper :-)

{1544}
hide / / print
ref: -2019 tags: HSIC information bottleneck deep learning backprop gaussian kernel date: 10-06-2021 17:23 gmt revision:5 [4] [3] [2] [1] [0] [head]

The HSIC Bottleneck: Deep learning without Back-propagation

In this work, the authors use a kernelized estimate of statistical independence as part of a 'information bottleneck' to set per-layer objective functions for learning useful features in a deep network. They use the HSIC, or Hilbert-schmidt independence criterion, as the independence measure.

The information bottleneck was proposed by Bailek (spikes..) et al in 1999, and aims to increase the mutual information between the layer representation and the labels while minimizing the mutual information between the representation and the input:

minP T i|XI(X;T i)βI(T i;Y)\frac{min}{P_{T_i | X}} I(X; T_i) - \beta I(T_i; Y)

Where T iT_i is the hidden representation at layer i (later output), XX is the layer input, and YY are the labels. By replacing I()I() with the HSIC, and some derivation (?), they show that

HSIC(D)=(m1) 2tr(K XHK YH)HSIC(D) = (m-1)^{-2} tr(K_X H K_Y H)

Where D=(x 1,y 1),...(x m,y m)D = {(x_1,y_1), ... (x_m, y_m)} are samples and labels, K X ij=k(x i,x j)K_{X_{ij}} = k(x_i, x_j) and K Y ij=k(y i,y j)K_{Y_{ij}} = k(y_i, y_j) -- that is, it's the kernel function applied to all pairs of (vectoral) input variables. H is the centering matrix. The kernel is simply a Gaussian kernel, k(x,y)=exp(1/2||xy|| 2/σ 2)k(x,y) = exp(-1/2 ||x-y||^2/\sigma^2) . So, if all the x and y are on average independent, then the inner-product will be mean zero, the kernel will be mean one, and after centering will lead to zero trace. If the inner product is large within the realm of the derivative of the kernel, then the HSIC will be large (and negative, i think). In practice they use three different widths for their kernel, and they also center the kernel matrices.

But still, the feedback is an aggregate measure (the trace) of the product of two kernelized (a nonlinearity) outer-product spaces of similarities between inputs. it's not unimaginable that feedback networks could be doing something like this...

For example, a neural network could calculate & communicate aspects of joint statistics to reward / penalize weights within a layer of a network, and this is parallelizable / per layer / adaptable to an unsupervised learning regime. Indeed, that was done almost exactly by this paper: Kernelized information bottleneck leads to biologically plausible 3-factor Hebbian learning in deep networks albeit in a much less intelligible way.


Robust Learning with the Hilbert-Schmidt Independence Criterion

Is another, later, paper using the HSIC. Their interpretation: "This loss-function encourages learning models where the distribution of the residuals between the label and the model prediction is statistically independent of the distribution of the instances themselves." Hence, given above nomenclature, E X(P T i|XI(X;T i))=0 E_X( P_{T_i | X} I(X ; T_i) ) = 0 (I'm not totally sure about the weighting, but might be required given the definition of the HSIC.)

As I understand it, the HSIC loss is a kernellized loss between the input, output, and labels that encourages a degree of invariance to input ('covariate shift'). This is useful, but I'm unconvinced that making the layer output independent of the input is absolutely essential (??)

{1552}
hide / / print
ref: -2020 tags: Principe modular deep learning kernel trick MNIST CIFAR date: 10-06-2021 16:54 gmt revision:2 [1] [0] [head]

Modularizing Deep Learning via Pairwise Learning With Kernels

  • Shiyu Duan, Shujian Yu, Jose Principe
  • The central idea here is to re-interpret deep networks, not with the nonlinearity as the output of a layer, but rather as the input of the layer, with the regression (weights) being performed on this nonlinear projection.
  • In this sense, each re-defined layer is implementing the 'kernel trick': tasks (like classification) which are difficult in linear spaces, become easier when projected into some sort of kernel space.
    • The kernel allows pairwise comparisons of datapoints. EG. a radial basis kernel measures the radial / gaussian distance between data points. A SVM is a kernel machine in this sense.
      • As a natural extension (one that the authors have considered) is to take non-pointwise or non-one-to-one kernel functions -- those that e.g. multiply multiple layer outputs. This is of course part of standard kernel machines.
  • Because you are comparing projected datapoints, it's natural to take contrastive loss on each layer to tune the weights to maximize the distance / discrimination between different classes.
    • Hence this is semi-supervised contrastive classification, something that is quite popular these days.
    • The last layer is of tuned with cross-entropy labels, but only a few are required since the data is well distributed already.
  • Demonstrated on small-ish datasets, concordant with their computational resources ...

I think in general this is an important result, even if its not wholly unique / somewhat anticipated (it's a year old at the time of writing). Modular training of neural networks is great for efficiency, parallelization, and biological implementations! Transport of weights between layers is hence non-essential.

Classes still are, but I wonder if temporal continuity can solve some of these problems?

(There is plenty of other effort in this area -- see also {1544})

{1551}
hide / / print
ref: -2014 tags: CNiFER Kleinfeld dopamine norepinephrine monoamine cell sensor date: 10-04-2021 14:50 gmt revision:2 [1] [0] [head]

Cell-based reporters reveal in vivo dynamics of dopamine and norepinephrine release in murine cortex

  • CNiFERs are clonal cell lines engineered to express a specific GPCR that is coupled to the Gq pathway and triggers an increase in intracellular calcium concentration, [Ca2+], which in turn is rapidly detected by a genetically encoded fluorescence resonance energy transfer (FRET)-based Ca2+ sensor. This system transforms neurotransmitter receptor binding into a change in fluorescence and provides a direct and real-time optical readout of local neurotransmitter activity. Furthermore, by using the natural receptor for a given transmitter, CNiFERs gain the chemical specificity and temporal dynamics present in vivo.
    • Clonal cell line = HEK293.
      • Human cells implanted into mice!
    • Gq pathway = through the phospholipase C-initosol triphosphate (PLC-IP3) pathway.
  • Dopamine sensor required the engineering of a chimeric Gqi5 protein for coupling to PLC. This was a 5-AA substitution (only!)

Referenced -- and used by the recent paper Reinforcement learning links spontaneous cortical dopamine impulses to reward, which showed that dopamine signaling itself can come under volitional, operant-conditioning (or reinforcement type) modulation.

{1550}
hide / / print
ref: -2011 tags: government polyicy observability submerged state America date: 09-23-2021 22:06 gmt revision:0 [head]

The Submerged State -- How Invisible Government Policies Undermine American Democracy. By Suzanne Mettler

(I've not read this book, just the blurb, but it looks like a defensible thesis) : Government polyicy, rather than distributing resources (money, infrastructure, services) as directly as possible to voters, have recently opted to distribute indirectly, through private companies. This gives the market & private organizations more perceived clout, perpetuates a level of corruption, and undermines American's faith in their government.

So, we need a better 'debugger' for policy in america? Something like a discrete chain rule to help people figure out what policies (and who) are responsible for the good / bad things in their life? Sure seems that the bureaucracy is could use some cleanup / is failing under burgeoning complexity. This is probably not dissimilar to cruddy technical systems.

{1549}
hide / / print
ref: -0 tags: gtk.css scrollbar resize linux qt5 date: 09-16-2021 15:18 gmt revision:2 [1] [0] [head]

Put this in ~/.config/gtk-3.0/gtk.css make scrollbars larger on high-DPI screens. ref

.scrollbar {
  -GtkScrollbar-has-backward-stepper: 1;
  -GtkScrollbar-has-forward-stepper: 1;
  -GtkRange-slider-width: 16;
  -GtkRange-stepper-size: 16;
}
scrollbar slider {
    /* Size of the slider */
    min-width: 16px;
    min-height: 16px;
    border-radius: 16px;

    /* Padding around the slider */
    border: 2px solid transparent;
}

.scrollbar.vertical slider,
scrollbar.vertical slider {
    min-height: 16px;
    min-width: 16px;
}

.scrollbar.horizontal.slider,
scrollbar.horizontal slider {
min-width: 16px;
min-height: 16px;
}

/* Scrollbar trough squeezes when cursor hovers over it. Disabling that
 */

.scrollbar.vertical:hover:dir(ltr),
.scrollbar.vertical.dragging:dir(ltr) {
    margin-left: 0px;
}

.scrollbar.vertical:hover:dir(rtl),
.scrollbar.vertical.dragging:dir(rtl) {
    margin-right: 0px;
}

.scrollbar.horizontal:hover,
.scrollbar.horizontal.dragging,
.scrollbar.horizontal.slider:hover,
.scrollbar.horizontal.slider.dragging {
    margin-top: 0px;
}
undershoot.top, undershoot.right, undershoot.bottom, undershoot.left { background-image: none; }
undershoot.top, undershoot.right, undershoot.bottom, undershoot.left { background-image: none; }

To make the scrollbars a bit easier to see in QT5 applications, run qt5ct (after apt-getting it), and add in a new style sheet, /usr/share/qt5ct/qss/scrollbar-simple-backup.qss

/* SCROLLBARS (NOTE: Changing 1 subcontrol means you have to change all of them)*/
QScrollBar{
  background: palette(alternate-base);
}
QScrollBar:horizontal{
  margin: 0px 0px 0px 0px;
}
QScrollBar:vertical{
  margin: 0px 0px 0px 0px;
}
QScrollBar::handle{
  background: #816891;
  border: 1px solid transparent;
  border-radius: 1px;
}
QScrollBar::handle:hover, QScrollBar::add-line:hover, QScrollBar::sub-line:hover{
  background: palette(highlight);
}
QScrollBar::add-line{
subcontrol-origin: none;
}
QScrollBar::add-line:vertical, QScrollBar::sub-line:vertical{
height: 0px;
}
QScrollBar::add-line:horizontal, QScrollBar::sub-line:horizontal{
width: 0px;
}
QScrollBar::sub-line{
subcontrol-origin: none;
}

{1548}
hide / / print
ref: -2021 tags: gated multi layer perceptrons transformers ML Quoc_Le Google_Brain date: 08-05-2021 06:00 gmt revision:4 [3] [2] [1] [0] [head]

Pay attention to MLPs

  • Using bilinear / multiplicative gating + deep / wide networks, you can attain similar accuracies as Transformers on vision and masked language learning tasks! No attention needed, just a in-network multiplicative term.
  • And the math is quite straightforward. Per layer:
    • Z=σ(XU),,Z^=s(Z),,Y=Z^V Z = \sigma(X U) ,, \hat{Z} = s(Z) ,, Y = \hat{Z} V
      • Where X is the layer input, σ\sigma is the nonlinearity (GeLU), U is a weight matrix, Z^\hat{Z} is the spatially-gated Z, and V is another weight matrix.
    • s(Z)=Z 1(WZ 2+b) s(Z) = Z_1 \odot (W Z_2 + b)
      • Where Z is divided into two parts along the channel dimension, Z 1Z 2Z_1 Z_2 . 'circleDot' is element-wise multiplication, and W is a weight matrix.
  • You of course need a lot of compute; this paper has nice figures of model accuracy scaling vs. depth / number of parameters / size. I guess you can do this if you're Google.

Pretty remarkable that an industrial lab freely publishes results like this. I guess the ROI is that they get the resultant improved ideas? Or, perhaps, Google is in such a dominant position in terms of data and compute that even if they give away ideas and code, provided some of the resultant innovation returns to them, they win. The return includes trained people as well as ideas. Good for us, I guess!

{1547}
hide / / print
ref: -2018 tags: luke metz meta learning google brain sgd model mnist Hebbian date: 08-05-2021 01:07 gmt revision:2 [1] [0] [head]

Meta-Learning Update Rules for Unsupervised Representation Learning

  • Central idea: meta-train a training-network (a MLP) which trains a task-network (also a MLP) to do unsupervised learning on one dataset.
  • The training network is optimized through SGD based on small-shot linear learning on a test set, typically different from the unsupervised training set.
  • The training-network is a per-weight MLP which takes in layer input, layer output, and a synthetic error (denoted η\eta ), and generates a and b, which are then fed into an outer-product Hebbian learning rule.
  • η\eta itself is formed through a backward pass through weights VV , which affords something like backprop -- but not exactly backprop, of course. See the figure.
  • Training consists of building up very large, backward through time gradient estimates relative to the parameters of the training-network. (And there are a lot!)
  • Trained on CIFAR10, MNIST, FashionMNIST, IMDB sentiment prediction. All have their input permuted to keep the training-network from learning per-task weights. Instead the network should learn to interpret the statistics between datapoints.
  • Indeed, it does this -- albeit with limits. Performance is OK, but only if you only do supervised learning on the very limited dataset used in the meta-optimization.
    • In practice, it's possible to completely solve tasks like MNIST with supervised learning; this gets to about 80% accuracy.
  • Images were kept small -- about 20x20 -- to speed up the inner loop unsupervised learning. Still, this took on the order of 200 hours across ~500 TPUs.
  • See, as a comparison, Keren's paper, Meta-learning biologically plausible semi-supervised update rules. It's conceptually nice but only evaluates the two-moons and two-gaussian datasets.

This is a clearly-written, easy to understand paper. The results are not highly compelling, but as a first set of experiments, it's successful enough.

I wonder what more constraints (fewer parameters, per the genome), more options for architecture modifications (e.g. different feedback schemes, per neurobiology), and a black-box optimization algorithm (evolution) would do?

{1449}
hide / / print
ref: -0 tags: sparse coding reference list olshausen field date: 08-04-2021 01:07 gmt revision:5 [4] [3] [2] [1] [0] [head]

This was compiled from searching papers which referenced Olshausen and Field 1996 PMID-8637596 Emergence of simple-cell receptive field properties by learning a sparse code for natural images.

{1546}
hide / / print
ref: -1992 tags: Linsker infomax Hebbian anti-hebbian linear perceptron unsupervised learning date: 08-04-2021 00:20 gmt revision:2 [1] [0] [head]

Local synaptic learning rules suffice to maximize mutual information in a linear network

  • Ralph Linsker, 1992.
  • A development upon {1545} -- this time with lateral inhibition trained through noise-contrast and anti-Hebbian plasticity.
  • {1545} does not perfectly maximize the mutual information between the input and output -- this allegedly requires the inverse of the covariance matrix, QQ .
    • As before, infomax principles; maximize mutual information MIH(Z)H(Z|S)MI \propto H(Z) - H(Z | S) where Z is the network output and S is the signal input. (note: minimize the conditional entropy of output given the input).
    • For a gaussian variable, H=12lndetQH = \frac{ 1}{ 2} ln det Q where Q is the covariance matrix. In this case Q=E|ZZ T|Q = E|Z Z^T |
    • since Z=C(S,N)Z = C(S,N) where C are the weights, S is the signal, and N is the noise, Q=CqC T+rQ = C q C^T + r where q is the covariance matrix of input noise and r is the cov.mtx. of the output noise.
    • (somewhat confusing): δH/δC=Q 1Cq\delta H / \delta C = Q^{-1}Cq
      • because .. the derivative of the determinant is complicated.
      • Check the appendix for the derivation. lndetQ=TrlnQln det Q = Tr ln Q and dH=1/2d(TrlnQ)=1/2Tr(Q 1dQ) dH = 1/2 d(Tr ln Q) = 1/2 Tr( Q^-1 dQ ) -- this holds for positive semidefinite matrices like Q.

  • From this he comes up with a set of rules whereby feedforward weights are trained in a Hebbian fashion, but based on activity after lateral activation.
  • The lateral activation has a weight matrix F=IαQF = I - \alpha Q (again Q is the cov.mtx. of Z). If y(0)=Y;y(t+1)=Y+Fy(t)y(0) = Y; y(t+1) = Y + Fy(t) , where Y is the feed-forward activation, then αy(inf)=Q 1Y\alpha y(\inf) = Q^{-1}Y . This checks out:
x = randn(1000, 10);
Q = x' * x;
a = 0.001;
Y = randn(10, 1);
y = zeros(10, 1); 
for i = 1:1000
	y = Y + (eye(10) - a*Q)*y;
end

y - pinv(Q)*Y / a % should be zero. 
  • This recursive definition is from Jacobi. αy(inf)=αΣ t=0 infF tY=α(IF) 1Y=Q 1Y\alpha y(\inf) = \alpha \Sigma_{t=0}^{\inf}F^tY = \alpha(I - F)^{-1} Y = Q^{-1}Y .
  • Still, you need to estimate Q through a running-average, ΔQ=1M(Y nY m+r nmQ NM)\Delta Q = \frac{ 1}{M}( Y_n Y_m + r_{nm} - Q_{NM} ) and since F=IαQF = I - \alpha Q , F is formed via anti-hebbian terms.

To this is added a 'sensing' learning and 'noise' unlearning phase -- one optimizes H(Z)H(Z) , the other minimizes H(Z|S)H(Z|S) . Everything is then applied, similar to before, to a gaussian-filtered one-dimensional white-noise stimuli. He shows this results in bandpass filter behavior -- quite weak sauce in an era where ML papers are expected to test on five or so datasets. Even if this was 1992 (nearly forty years ago!), it would have been nice to see this applied to a more realistic dataset; perhaps some of the following papers? Olshausen & Field came out in 1996 -- but they applied their algorithm to real images.

In both Olshausen & this work, no affordances are made for multiple layers. There have to be solutions out there...

{1545}
hide / / print
ref: -1988 tags: Linsker infomax linear neural network hebbian learning unsupervised date: 08-03-2021 06:12 gmt revision:2 [1] [0] [head]

Self-organizaton in a perceptual network

  • Ralph Linsker, 1988.
  • One of the first (verbose, slightly diffuse) investigations of the properties of linear projection neurons (e.g. dot-product; no non-linearity) to express useful tuning functions.
  • ''Useful' is here information-preserving, in the face of noise or dimensional bottlenecks (like PCA).
  • Starts with Hebbian learning functions, and shows that this + white-noise sensory input + some local topology, you can get simple and complex visual cell responses.
    • Ralph notes that neurons in primate visual cortex are tuned in utero -- prior real-world visual experience! Wow. (Who did these studies?)
    • This is a very minimalistic starting point; there isn't even structured stimuli (!)
    • Single neuron (and later, multiple neurons) are purely feed-forward; author cautions that a lack of feedback is not biologically realistic.
      • Also note that this was back in the Motorola 680x0 days ... computers were not that powerful (but certainly could handle more than 1-2 neurons!)
  • Linear algebra shows that Hebbian synapses cause a linear layer to learn the covariance function of their inputs, QQ , with no dependence on the actual layer activity.
  • When looked at in terms of an energy function, this is equivalent to gradient descent to maximize the layer-output variance.
  • He also hits on:
    • Hopfield networks,
    • PCA,
    • Oja's constrained Hebbian rule δw i<L 2(L 1L 2w i)> \delta w_i \propto &lt; L_2(L_1 - L_2 w_i) &gt; (that is, a quadratic constraint on the weight to make Σw 21\Sigma w^2 \sim 1 )
    • Optimal linear reconstruction in the presence of noise
    • Mutual information between layer input and output (I found this to be a bit hand-wavey)
      • Yet he notes critically: "but it is not true that maximum information rate and maximum activity variance coincide when the probability distribution of signals is arbitrary".
        • Indeed. The world is characterized by very non-Gaussian structured sensory stimuli.
    • Redundancy and diversity in 2-neuron coding model.
    • Role of infomax in maximizing the determinant of the weight matrix, sorta.

One may critically challenge the infomax idea: we very much need to (and do) throw away spurious or irrelevant information in our sensory streams; what upper layers 'care about' when making decisions is certainly relevant to the lower layers. This credit-assignment is neatly solved by backprop, and there are a number 'biologically plausible' means of performing it, but both this and infomax are maybe avoiding the problem. What might the upper layers really care about? Likely 'care about' is an emergent property of the interacting local learning rules and network structure. Can you search directly in these domains, within biological limits, and motivated by statistical reality, to find unsupervised-learning networks?

You'll still need a way to rank the networks, hence an objective 'care about' function. Sigh. Either way, I don't per se put a lot of weight in the infomax principle. It could be useful, but is only part of the story. Otherwise Linsker's discussion is accessible, lucid, and prescient.

Lol.

{1543}
hide / / print
ref: -2019 tags: backprop neural networks deep learning coordinate descent alternating minimization date: 07-21-2021 03:07 gmt revision:1 [0] [head]

Beyond Backprop: Online Alternating Minimization with Auxiliary Variables

  • This paper is sort-of interesting: rather than back-propagating the errors, you optimize auxiliary variables, pre-nonlinearity 'codes' in a last-to-first layer order. The optimization is done to minimize a multimodal logistic loss function; math is not done to minimize other loss functions, but presumably this is not a limit. The loss function also includes a quadratic term on the weights.
  • After the 'codes' are set, optimization can proceed in parallel on the weights. This is done with either straight SGD or adaptive ADAM.
  • Weight L2 penalty is scheduled over time.

This is interesting in that the weight updates can be cone in parallel - perhaps more efficient - but you are still propagating errors backward, albeit via optimizing 'codes'. Given the vast infractructure devoted to auto-diff + backprop, I can't see this being adopted broadly.

That said, the idea of alternating minimization (which is used eg for EM clustering) is powerful, and this paper does describe (though I didn't read it) how there are guarantees on the convexity of the alternating minimization. Likewise, the authors show how to improve the performance of the online / minibatch algorithm by keeping around memory variables, in the form of covariance matrices.

{1542}
hide / / print
ref: -0 tags: gpu burn stress test github cuda date: 07-13-2021 21:32 gmt revision:0 [head]

https://github.com/wilicc/gpu-burn Mult-gpu stress test.

Are your GPUs overclocked to the point of overheating / being unreliable?

{1541}
hide / / print
ref: -0 tags: machine learning blog date: 04-22-2021 15:43 gmt revision:0 [head]

Paper notes by Vitaly Kurin

Like this blog but 100% better!

{1540}
hide / / print
ref: -2020 tags: feedback alignment local hebbian learning rules stanford date: 04-22-2021 03:26 gmt revision:0 [head]

Two Routes to Scalable Credit Assignment without Weight Symmetry

This paper looks at five different learning rules, three purely local, and two non-local, to see if they can work as well as backprop in training a deep convolutional net on ImageNet. The local learning networks all feature forward weights W and backward weights B; the forward weights (+ nonlinearities) pass the information to lead to a classification; the backward weights pass the error, which is used to locally adjust the forward weights.

Hence, each fake neuron has locally the forward activation, the backward error (or loss gradient), the forward weight, backward weight, and Hebbian terms thereof (e.g the outer product of the in-out vectors for both forward and backward passes). From these available variables, they construct the local learning rules:

  • Decay (exponentially decay the backward weights)
  • Amp (Hebbian learning)
  • Null (decay based on the product of the weight and local activation. This effects a Euclidean norm on reconstruction.

Each of these serves as a "regularizer term" on the feedback weights, which governs their learning dynamics. In the case of backprop, the backward weights B are just the instantaneous transpose of the forward weights W. A good local learning rule approximates this transpose progressively. They show that, with proper hyperparameter setting, this does indeed work nearly as well as backprop when training a ResNet-18 network.

But, hyperparameter settings don't translate to other network topologies. To allow this, they add in non-local learning rules:

  • Sparse (penalizes the Euclidean norm of the previous layer; gradient is the outer product of the (current layer activation &transpose) * B)
  • Self (directly measures the forward weights and uses them to update the backward weights)

In "Symmetric Alignment", the Self and Decay rules are employed. This is similar to backprop (the backward weights will track the forward ones) with L2 regularization, which is not new. It performs very similarly to backprop. In "Activation Alignment", Amp and Sparse rules are employed. I assume this is supposed to be more biologically plausible -- the Hebbian term can track the forward weights, while the Sparse rule regularizes and stabilizes the learning, such that overall dynamics allow the gradient to flow even if W and B aren't transposes of each other.

Surprisingly, they find that Symmetric Alignment to be more robust to the injection of Gaussian noise during training than backprop. Both SA and AA achieve similar accuracies on the ResNet benchmark. The authors then go on to explain the plausibility of non-local but approximate learning rules with Regression discontinuity design ala Spiking allows neurons to estimate their causal effect.


This is a decent paper,reasonably well written. They thought trough what variables are available to affect learning, and parameterized five combinations that work. Could they have done the full matrix of combinations, optimizing just they same as the metaparameters? Perhaps, but that would be even more work ...

Regarding the desire to reconcile backprop and biology, this paper does not bring us much (if at all) closer. Biological neural networks have specific and local uses for error; even invoking 'error' has limited explanatory power on activity. Learning and firing dynamics, of course of course. Is the brain then just an overbearing mess of details and overlapping rules? Yes probably but that doesn't mean that we human's can't find something simpler that works. The algorithms in this paper, for example, are well described by a bit of linear algebra, and yet they are performant.

{1539}
hide / / print
ref: -0 tags: saab EPC date: 03-22-2021 01:29 gmt revision:0 [head]

https://webautocats.com/epc/saab/sbd/ -- Online, free parts look-up for Saab cars. Useful.

{1538}
hide / / print
ref: -2010 tags: neural signaling rate code patch clamp barrel cortex date: 03-18-2021 18:41 gmt revision:0 [head]

PMID-20596024 Sensitivity to perturbations in vivo implies high noise and suggests rate coding in cortex

  • How did I not know of this paper before.
  • Solid study showing that, while a single spike can elicit 28 spikes in post-synaptic neurons, the associated level of noise is indistinguishable from intrinsic noise.
  • Hence the cortex should communicate / compute in rate codes or large synchronized burst firing.
    • They found large bursts to be infrequent, timing precision to be low, hence rate codes.
    • Of course other examples, e.g auditory cortex, exist.

Cortical reliability amid noise and chaos

  • Noise is primarily of synaptic origin. (Dropout)
  • Recurrent cortical connectivity supports sensitivity to precise timing of thalamocortical inputs.

{1537}
hide / / print
ref: -0 tags: cortical computation learning predictive coding reviews date: 02-23-2021 20:15 gmt revision:2 [1] [0] [head]

PMID-30359606 Predictive Processing: A Canonical Cortical Computation

  • Georg B Keller, Thomas D Mrsic-Flogel
  • Their model includes on two error signals: positive and negative for reconciling the sensory experience with the top-down predictions. I haven't read the full article, and disagree that such errors are explicit to the form of neurons, but the model is plausible. Hence worth recording the paper here.

PMID-23177956 Canonical microcircuits for predictive coding

  • Andre M Bastos, W Martin Usrey, Rick A Adams, George R Mangun, Pascal Fries, Karl J Friston
  • We revisit the established idea that message passing among hierarchical cortical areas implements a form of Bayesian inference-paying careful attention to the implications for intrinsic connections among neuronal populations.
  • Have these algorithms been put to practical use? I don't know...

Control of synaptic plasticity in deep cortical networks

  • Pieter R. Roelfsema & Anthony Holtmaat
  • Basically argue for a many-factor learning rule at the feedforward and feedback synapses, taking into account pre, post, attention, and reinforcement signals.
  • See comment by Tim Lillicrap and Blake Richards.