You are not authenticated, login.
text: sort by
tags: modified
type: chronology
hide / / print
ref: -0 tags: date: 01-09-2022 19:04 gmt revision:1 [0] [head]

The Sony Xperia XZ1 compact is a better phone than an Apple iPhone 12 mini

I don't normally write any personal options here -- just half-finished paper notes riddled with typos (haha) -- but this one has been bothering me for a while.

November 2020 I purchased an iPhone 12 mini to replace my aging Sony Xperia XZ1 compact. (Thinking of staying with Android, I tried out a Samsung S10e as well, but didn't like it.) Having owned and used the iPhone for a year and change, I still prefer the Sony. Here is why:

  • Touch screen
    • The iPhone is MUCH more sensitive to sweat than the Sony
    • This is the biggest problem, since I like to move (hike, bike, kayak etc), it lives in my pocket, and inevitably gets a bit of condensation or water on it.
    • The ipPhone screen is rendered frustrating to use with even an imperceptible bit of moisture on it.
      • Do iPhone users not sweat?
      • Frequently I can't even select the camera app! Or switch to maps!
        • A halfway fix is to turn the screen off then on again. Halfway.
    • The Sony, in comparison, is relatively robust, and works even to the point where there were droplets of water on it.
  • Size
    • They are both about the same size with a case, Sony is 129 x 65 x 9.3 mm ; iPhone mini is 131.5 x 64.2 x 7.4mm.
    • This size is absolutely perfect and manufacturers need to make more phones with these dimensions!
    • If anything, the iPhone is better here -- the rounded corners are nice.
  • Battery
    • Hands down, the Sony. Lasts >=2x as long as the iPhone.
  • Processor
    • Both are fast enough.
  • Software
    • Apple is not an ecosystem. No. It's a walled garden where a select few plants may grow. You do what Apple wants you to do.
      • E.g. want to use any Google apps on iPhone? No problem! Want to use any Apple apps on Android or web or PC? Nope, sorry, you have to buy a $$$ MacBook pro.
    • Ok, the privacy on an iPhone is nice. Modulo that bit where they scanned our photos.
      • As well as the ability to manage notifications & basically turn them all off :)
    • There are many more apps on Android, and they are much less restricted in what they can do.
      • For example, recently we were in the desert & wanted a map of where the cell signal was strong, for remote-working. This is easy on Android (there is an app for it).
        • This is impossible on iPhone (the apps don't have access to the information).
      • Second example, you can ssh into an Android and use that to download large files (e.g. packages, datasets) to avoid using limited tethering data.
        • This is also impossible on iPhone.
    • Why does iMessage make all texts from Android users yucky green? Why is there zero option to change this?
    • Why does iMessage send very low resolution photos to my friends and family using Android? It sends beautiful full-res photos to other Apple phones.
    • Why is there no web interface to iMessage?
      • Ugh, this iPhone is such an elitist snob.
    • You can double-tap on the square in Android to quickly switch between apps, which is great.
    • Apple noticeably auto-corrects to a smaller vocabulary than desired. Android is less invasive in this respect.
  • Cell signal
    • They are similarly unreliable, though the iPhone has 5G & many more wireless band, which is great.
    • Still, frequently I'll have one-two bars of connectivity & yet Google Maps will say "you are offline". This is much less frequent on the Sony.
  • Screen
    • iPhone screen is better.
  • Camera
    • iPhone camera is very very much better.
  • Speaker
    • iPhone speaker much better. But it sure burns the battery.
  • Wifi
    • iPhone will periodically disconnect from Wifi when on Facetime calls. Sony doesn't do this.
      • Facetime only works with Apple devices.
  • Price
    • Sony wins
  • Unlock
    • Face unlock is a cool idea, but we all wear masks now.
    • The Sony has a fingerprint sensor, which is better.
      • In the case where I'm moving (and possibly sweaty), Android is smart enough to allow quick unlock, for access to the camera app or maps. Great feature.

Summary: I'll try to get my moneys worth out of the iPhone; when it dies, will buy the smallest waterproof Android phone that supports my carrier's bands.

hide / / print
ref: -0 tags: date: 01-09-2022 19:03 gmt revision:1 [0] [head]

Cortical response selectivity derives from strength in numbers of synapses

  • Benjamin Scholl, Connon I. Thomas, Melissa A. Ryan, Naomi Kamasawa & David Fitzpatrick
  • "Using electron microscopy reconstruction of individual synapses as a metric of strength, we find no evidence that strong synapses have a predominant role in the selectivity of cortical neuron responses to visual stimuli. Instead, selectivity appears to arise from the total number of synapses activated by different stimuli."
  • "Our results challenge the role of Hebbian mechanisms in shaping neuronal selectivity in cortical circuits, and suggest that selectivity reflects the co-activation of large populations of presynaptic neurons with similar properties and a mixture of strengths. "
    • Interesting -- so this is consistent with ANNs / feature detectors / vector hypothesis.
    • It would imply that the mapping is dense rather than sparse -- but to see this, you'd need to record the activity of all these synapses in realtime.
      • Which is possible, (e.g. lightbeads, fast axial focusing), just rather difficult for now.
  • To draw really firm conclusions, would need a thorough stimulus battery, not just drifting gratings.
    • It may change this result: "Surprisingly, the strength of individual synapses was uncorrelated with functional similarity to the somatic output (that is, absolute orientation preference difference)"

hide / / print
ref: work-0 tags: distilling free-form natural laws from experimental data Schmidt Cornell automatic programming genetic algorithms date: 12-30-2021 05:11 gmt revision:7 [6] [5] [4] [3] [2] [1] [head]

Distilling free-form natural laws from experimental data

  • The critical step was to use the full set of all pairs of partial derivatives ( δx/δy\delta x / \delta y ) to evaluate the search for invariants.
  • The selection of which partial derivatives are held to be independent / which variables are dependent is a bit of a trick too -- see the supplemental information.
    • Even yet, with a 4D data set the search for natural laws took ~ 30 hours.
  • This was via a genetic algorithm, distributed among 'islands' on different CPUs, with mutation and single-point crossover.
  • Not sure what the IL is, but it appears to be floating-point assembly.
  • Timeseries data is smoothed with Loess smoothing, which fits a polynomial to the data, and hence allows for smoother / more analytic derivative calculation.
    • Then again, how long did it take humans to figure out these invariants? (Went about it in a decidedly different way..)
    • Further, how long did it take for biology to discover similar 'design equations'?
      • The same algorithm has been applied to biological data - a metabolic pathway - with some success pub 2011.
      • Of course evolution had to explore a much larger space - proteins and regulatory pathways, not simpler mathematical expressions / linkages.

Since his Phd, Michael Schmidt has gone on to found Nutonian, which produced Eurequa software, apparently without dramatic new features other than being able to use the cloud for equation search. (Probably he improved many other detailed facets of the software..). Nutonian received $4M in seed funding, according to Crunchbase.

In 2017, Nutonian was acquired by Data Robot (for an undisclosed amount), where Michael has worked since, rising to the title of CTO.

Always interesting to follow up on the authors of these classic papers!

hide / / print
ref: -0 tags: SAT solver blog post date: 12-30-2021 00:29 gmt revision:0 [head]

Modern SAT solvers: fast, neat and underused (part 1 of N)

A set of posts that are worth re-reading.

hide / / print
ref: -2021 tags: synaptic imaging weights 2p oregon markov date: 12-29-2021 23:30 gmt revision:2 [1] [0] [head]

Distinct in vivo dynamics of excitatory synapses onto cortical pyramidal neurons and parvalbumin-positive interneurons

  • Joshua B.Melander, Aran Nayebi, Bart C.Jongbloets, Dale A.Fortin, Maozhen Qin, Surya Ganguli, Tianyi Mao, Haining Zhong
  • Cre-dependent mVenus labeled PSD-95, in both excitatory pyramidadl neurons & inhibitory PV interneurons.
  • morphology labeled with tdTomato
  • Longitudinal imaging of individual exictatoy post-synaptic densityies; estimated weight from fluorescence; examined spine appearance and disappearance
  • PV synapses were more stable over the 24-day period than synapses on pyramidal neurons.
  • Likewise, large synapses were more likely to remain over the imaging period.
  • Both followed log-normal distributions in 'strengths'
  • Changes were well modeled by a Markov process, which puts high probability on small changes.
  • But these changes are multiplicative (+ additive component in PV cells)

hide / / print
ref: -0 tags: diffusion models image generation OpenAI date: 12-24-2021 05:50 gmt revision:0 [head]

Some investigations into denoising models & their intellectual lineage:

Deep Unsupervised Learning using Nonequilibrium Thermodynamics 2015

  • Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli
  • Starting derivation of using diffusion models for training.
  • Verrry roughly, the idea is to destroy the structure in an image using diagonal Gaussian per-pixel, and train an inverse-diffusion model to remove the noise at each step. Then start with Gaussian noise and reverse-diffuse an image.
  • Diffusion can take 100s - 1000s of steps; steps are made small to preserve the assumption that the conditional probability, p(x t1|x t)N(0,I)p(x_{t-1}|x_t) \propto N(0, I)
    • The time variable here goes from 0 (uncorrupted data) to T (fully corrupted / Gaussian noise)

Generative Modeling by Estimating Gradients of the Data Distribution July 2019

  • Yang Song, Stefano Ermon

Denoising Diffusion Probabilistic Models June 2020

  • Jonathan Ho, Ajay Jain, Pieter Abbeel
  • A diffusion model that can output 'realistic' images (low FID / low log-likelihood )

Improved Denoising Diffusion Probabilistic Models Feb 2021

  • Alex Nichol, Prafulla Dhariwal
  • This is directly based on Ho 2020 and Shol-Dickstein 2015, but with tweaks
  • The objective is no longer the log-likelihood of the data given the parameters (per pixel); it's now mostly the MSE between the corrupting noise (which is known) and the estimated noise.
  • That is, the neural network model attempts, given x tx_t to estimate the noise which corrupted it, which then can be used to produce x t1x_{t-1}
    • Simpicity. Satisfying.
  • The also include a reweighted version of the log-likelihood loss, which puts more emphasis on the first few steps of noising. These steps are more important for NLL; reweighting also smooths the loss.
    • I think that, per Ho above, the simple MSE loss is sufficient to generate good images, but the reweighted LL improves the likelihood of the parameters.
  • There are some good crunchy mathematical details on how how exactly the the mean and variance of the estimated Gaussian distributions are handled -- at each noising step, you need to scale the mean down to prevent Brownian / random walk.
    • Taking these further, you can estimate an image at any point t in the forward diffusion chain. They use this fact to optimize the function approximator (a neural network; more later) using a (random but re-weighted/scheduled) t and the LL loss + simple loss.
  • Ho 2020 above treats the variance of the noising Gaussian as fixed -- that is, β \beta ; this paper improves the likelihood by adjusting the noise varaince mostly at the last steps by a ~β t~\beta_t , and then further allowing the function approximator to tune the variance (a multiplicative factor) per inverse-diffusion timestep.
    • TBH I'm still slightly foggy on how you go from estimating noise (this seems like samples, concrete) to then estimating variance (which is variational?). hmm.
  • Finally, they schedule the forward noising with a cosine^2, rather than a linear ramp. This makes the last phases of corruption more useful.
  • Because they have an explicit parameterization of the noise varaince, they can run the inverse diffusion (e.g. image generation) faster -- rather than 4000 steps, which can take afew minutes on a GPU, they can step up the variance and run it only for 50 steps and get nearly as good images.

Diffusion Models Beat GANs on Image Synthesis May 2021

  • Prafulla Dhariwal, Alex Nichol

In all of above, it seems that the inverse-diffusion function approximator is a minor player in the paper -- but of course, it's vitally important to making the system work. In some sense, this 'diffusion model' is as much a means of training the neural network as it is a (rather inefficient, compared to GANs) way of sampling from the data distribution. In Nichol & Dhariwal Feb 2021, they use a U-net convolutional network (e.g. start with few channels, downsample and double the channels until there are 128-256 channels, then upsample x2 and half the channels) including multi-headed attention. Ho 2020 used single-headed attention only at the 16x16 level. Ho 2020 in turn was based on PixelCNN++

PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications Jan 2017

  • Tim Salimans, Andrej Karpathy, Xi Chen, Diederik P. Kingma

which is an improvement to (e.g. add selt-attention layers)

Conditional Image Generation with PixelCNN Decoders

  • Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, Koray Kavukcuoglu

Most recently,

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

  • Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, Mark Chen

Added text-conditional generation + many more parameters + much more compute to yield very impressive image results + in-painting. This last effect is enabled by the fact that it's a full generative denoising probabilistic model -- you can condition on other parts of the image!

hide / / print
ref: -2021 tags: hippocampal behavior scale plasticity Magee Romani Bittner date: 12-20-2021 22:39 gmt revision:0 [head]

Bidirectional synaptic plasticity rapidly modifies hippocampal representations

  • Normal Hebbian plasticity depends on pre and post synaptic activity & their time course.
  • Three-factor plasticity depends on pre, post, and neuromodulatory activity, typically formalized as an eligibility trace (ET) and instructive signal (IS).
  • Here they show that dendritic-plateau dependent hippocampal place field generation, in particular LTD, is not (quite so) dependent on post synaptic activity.
  • Instead, it appears to be a 'register update' operation, where a new pattern is remembered (through LTP) and an old pattern is forgotten (through LTD).
    • That is, the synapses are updating information, not accumulating information.
  • The eq for a single synapse: ΔW/δt=(W maxW)k +q +(ET*IS)Wk q (ET*IS)\Delta W / \delta t = (W_{max} - W) k^+ q^+(ET * IS) - W k^- q^-(ET * IS)
    • Where k are the learning rates, and q are the nonlinear functions regulating potentiation / depression based on eligibility trace and instructive signal.

I'm still not 100% sure that this excludes any influence on presynaptic activity ... they didn't control for that. But certainly LTD in their model does not require postsynaptic activity; indeed, it may only require net-synaptic homeostasis.

hide / / print
ref: -0 tags: SVD vocabulary english latent vector space Plato date: 12-20-2021 22:27 gmt revision:1 [0] [head]

A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge

  • A whole lot of verbiage here for an old, important, but relatively straightforward result:
    • Take ~30k encyclopedia articles.
    • From them, make a vocabulary of ~ 60k words.
    • Form a sparse matrix with rows being the vocabulary word, and columns being the encyclopedia article.
    • Perform large, sparse SVD on this matrix.
      • How? He doesn't say.
    • Take the top 300 singular values & associated V vectors, and use these as an embedding space for vocabulary.
  • The 300-dim embedding can then be used to perform analysis to solve TOEFL synonym problems
    • Map the cue and the multiple choice query words to 300-dim space, and select the one with the highest cosine similarity.

The fact that sVD works at all, and pulls out some structure is interesting! Not nearly as good as word2vec.

hide / / print
ref: -0 tags: concept net NLP transformers graph representation knowledge date: 11-04-2021 17:48 gmt revision:0 [head]

Symbolic Knowledge Distillation: from General Language Models to Commonsense Models

  • From a team at University of Washington / Allen institute for artificial intelligence/
  • Courtesy of Yannic Kilcher's youtube channel.
  • General idea: use GPT-3 as a completion source given a set of prompts, like:
    • X starts running
      • So, X gets in shape
    • X and Y engage in an argument
      • So, X wants to avoid Y.
  • There are only 7 linkage atoms (edges, so to speak) in these queries, but of course many actions / direct objects.
    • These prompts are generated from the Atomic 20-20 human-authored dataset.
    • The prompts are fed into 175B parameter DaVinci model, resulting in 165k examples in the 7 linkages after cleaning.
    • In turn the 165k are fed into a smaller version of GPT-3, Curie, that generates 6.5M text examples, aka Atomic 10x.
  • Then filter the results via a second critic model, based on fine-tuned RoBERTa & human supervision to determine if a generated sentence is 'good' or not.
  • By throwing away 62% of Atomic 10x, they get a student accuracy of 96.4%, much better than the human-designed knowledge graph.
    • They suggest that one way thins works is by removing degenerate outputs from GPT-3.

Human-designed knowledge graphs are described here: ConceptNet 5.5: An Open Multilingual Graph of General Knowledge

And employed for profit here: https://www.luminoso.com/

hide / / print
ref: -0 tags: gtk.css scrollbar resize linux qt5 date: 10-28-2021 18:47 gmt revision:3 [2] [1] [0] [head]

Put this in ~/.config/gtk-3.0/gtk.css make scrollbars larger on high-DPI screens. ref

.scrollbar {
  -GtkScrollbar-has-backward-stepper: 1;
  -GtkScrollbar-has-forward-stepper: 1;
  -GtkRange-slider-width: 16;
  -GtkRange-stepper-size: 16;
scrollbar slider {
    /* Size of the slider */
    min-width: 16px;
    min-height: 16px;
    border-radius: 16px;

    /* Padding around the slider */
    border: 2px solid transparent;

.scrollbar.vertical slider,
scrollbar.vertical slider {
    min-height: 16px;
    min-width: 16px;

scrollbar.horizontal slider {
min-width: 16px;
min-height: 16px;

/* Scrollbar trough squeezes when cursor hovers over it. Disabling that

.scrollbar.vertical.dragging:dir(ltr) {
    margin-left: 0px;

.scrollbar.vertical.dragging:dir(rtl) {
    margin-right: 0px;

.scrollbar.horizontal.slider.dragging {
    margin-top: 0px;
undershoot.top, undershoot.right, undershoot.bottom, undershoot.left { background-image: none; }
undershoot.top, undershoot.right, undershoot.bottom, undershoot.left { background-image: none; }

Also add:

to your ~/.bashrc

To make the scrollbars a bit easier to see in QT5 applications, run qt5ct (after apt-getting it), and add in a new style sheet, /usr/share/qt5ct/qss/scrollbar-simple-backup.qss

/* SCROLLBARS (NOTE: Changing 1 subcontrol means you have to change all of them)*/
  background: palette(alternate-base);
  margin: 0px 0px 0px 0px;
  margin: 0px 0px 0px 0px;
  background: #816891;
  border: 1px solid transparent;
  border-radius: 1px;
QScrollBar::handle:hover, QScrollBar::add-line:hover, QScrollBar::sub-line:hover{
  background: palette(highlight);
subcontrol-origin: none;
QScrollBar::add-line:vertical, QScrollBar::sub-line:vertical{
height: 0px;
QScrollBar::add-line:horizontal, QScrollBar::sub-line:horizontal{
width: 0px;
subcontrol-origin: none;

hide / / print
ref: -0 tags: adaptive optics two photon microscopy date: 10-26-2021 18:17 gmt revision:1 [0] [head]

Recently I've been underwhelmed by the performance of adaptive optics (AO) for imaging head-fixed cranial-window mice. There hasn't been much of an improvement, despite significant optimization effort. This begs the question: where are AO microscopes used?

When the purpose of a paper is to explain and qualify an novel AO approach, the improvement is always good, >> 2x. Yet, in the one paper (first below) when the purpose was neuroscience, not optics, the results are less inspiring. Are the results from the optics papers cherry-picked?

Thalamus provides layer 4 of primary visual cortex with orientation- and direction-tuned inputs Wenzhi Sun, Zhongchao Tan, Brett D Mensh & Na Ji 2016 https://www.nature.com/articles/nn.4196

  • This is the primary (only?) paper where AO was used, but the focus was biology: measuring the tuning properties of thalamic boutons in mouse visual cortex. Which they did, well!
  • Surprisingly, the largest improvement was not from using AO, but rather from thinning the cranial window from 340um to 170um.
  • "With a 340-μm-thick cranial window, 70% of all boutons appeared to be non-responsive to visual stimuli and only 7% satisfied OS criteria. With a thinner cranial window of 170-μm thickness, we found that 31% of boutons satisfied OS criteria (of total n = 1,302, 5 mice), which was still substantially fewer than 48% OS boutons as determined when the same boutons (n = 1,477, 5 mice) were imaged after aberration correction by adaptive optics"

Direct wavefront sensing for high-resolution in vivo imaging in scattering tissue Kai Wang, Wenzhi Sun, Christopher T. Richie, Brandon K. Harvey, Eric Betzig & Na Ji, 2015 https://www.nature.com/articles/ncomms8276

  • Direct wavefront sensing using indocayanine green + Andor iXon 897 EMCCD Shack-Hartmann wavefront sensor (read: expensive).
  • Alpao DM97-15, basically the same as ours.
  • Fairly local wavefront corrections, see figure 2.
  • Also note that these wavefront corrections seem low-order, hence should be correctable via a DM

Multiplexed aberration measurement for deep tissue imaging in vivo Chen Wang, Rui Liu, Daniel E Milkie, Wenzhi Sun, Zhongchao Tan, Aaron Kerlin, Tsai-Wen Chen, Douglas S Kim & Na Ji 2014 https://www.nature.com/articles/nmeth.3068

  • Use a DMD (including a dispersion pre-compensator) to amplitude modulate phase ramps on a wavefront-modulating SLM. Each phase-ramp segment of the SLM was modulated at a different frequency, allowing for the optimal phase to be pulled out later through a Fourier transform.
  • Again, very good performance at depth in the mouse brain.

hide / / print
ref: -2021 tags: FIBSEM electron microscopy presynaptic plasticity activity Funke date: 10-12-2021 17:03 gmt revision:0 [head]

Ultrastructural readout of in vivo synaptic activity for functional connectomics

  • Anna Simon, Arnd Roth, Arlo Sheridan, Mehmet Fişek, Vincenzo Marra, Claudia Racca, View ORCID ProfileJan Funke, View ORCID ProfileKevin Staras, Michael Häusser
  • Did FIB-SEM on FM1-43 dye labeled synapses, then segmented the cells using machine learning, as Jan has pioneered.
    • FM1-43FX is membrane impermeable, and labels only synaptic vesicles that have been recycled after dye loading. (Invented in 1992!)
    • FM1-43FX is also able to photoconvert diaminobenzidene (DAB) into a amorphous highly conjugated polymer with high affinity for osmium tetroxide
  • This allows for a snapshot of ultrastructural presynaptic plasticity / activity.
  • N=84 boutons, but n=7 pairs / triples of boutons from the same axon.
    • These boutons have the same presynaptic spiking activity, and hence are expected to have the same release probability, and hence the same photoconversion (PC) labeling.
      • But they don't! The ratio of PC+ vesicle numbers between boutons on the same neuron is low, mean < 0.4, which suggests some boutons have high neurotransmitter release and recycling, others have low...
  • Quote in the abstract: We also demonstrate that neighboring boutons of the same axon, which share the same spiking activity, can differ greatly in their presynaptic release probability.
    • Well, sorta, the data here is a bit weak. It might all be lognormal fluctuations, as has been well demonstrated.
    • When I read it I was excited to think of the influence of presynaptic inhibition / modulation, which has not been measured here, but is likely to be important.

hide / / print
ref: -2020 tags: dreamcoder ellis program induction ai tenenbaum date: 10-10-2021 17:32 gmt revision:2 [1] [0] [head]

DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning

  • Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lucas Morales, Luke Hewitt, Armando Solar-Lezama, Joshua B. Tenenbaum

This paper describes a system for adaptively finding programs which succinctly and accurately produce desired output. These desired outputs are provided by the user / test system, and come from a number of domains:

  • list (as in lisp) processing,
  • text editing,
  • regular expressions,
  • line graphics,
  • 2d lego block stacking,
  • symbolic regression (ish),
  • functional programming,
  • and physcial laws.
Some of these domains are naturally toy-like, eg. the text processing, but others are deeply impressive: the system was able to "re-derive" basic physical laws of vector calculus in the process of looking for S-expression forms of cheat-sheet physics equations. These advancements result from a long lineage of work, perhaps starting from the Helmholtz machine PMID-7584891 introduced by Peter Dayan, Geoff Hinton and others, where onemodel is trained to generate patterns given context (e.g.) while a second recognition module is trained to invert this model: derive context from the patterns. The two work simultaneously to allow model-exploration in high dimensions.

Also in the lineage is the EC2 algorithm, which most of the same authors above published in 2018. EC2 centers around the idea of "explore - compress" : explore solutions to your program induction problem during the 'wake' phase, then compress the observed programs into a library by extracting/factoring out commonalities during the 'sleep' phase. This of course is one of the core algorithms of human learning: explore options, keep track of both what worked and what didn't, search for commonalities among the options & their effects, and use these inferred laws or heuristics to further guide search and goal-setting, thereby building a buffer attack the curse of dimensionality. Making the inferred laws themselves functions in a programming library allows hierarchically factoring the search task, making exploration of unbounded spaces possible. This advantage is unique to the program synthesis approach.

This much is said in the introduction, though perhaps with more clarity. DreamCoder is an improved, more-accessible version of EC2, though the underlying ideas are the same. It differs in that the method for constructing libraries has improved through the addition of a powerful version space for enumerating and evaluating refactors of the solutions generated during the wake phase. (I will admit that I don't much understand the version space system.) This version space allows DreamCoder to collapse the search space for re-factorings by many orders of magnitude, and seems to be a clear advancement. Furthermore, DreamCoder incorporates a second phase of sleep: "dreaming", hence the moniker. During dreaming the library is used to create 'dreams' consisting of combinations of the library primitives, which are then executed with training data as input. These dreams are then used to train up a neural network to predict which library and atomic objects to use in given contexts. Context in this case is where in the parse tree a given object has been inserted (it's parent and which argument number it sits in); how the data-context is incorporated to make this decision is not clear to me (???).

This neural dream and replay-trained neural network is either a GRU recurrent net with 64 hidden states, or a convolutional network feeding into a RNN. The final stage is a linear ReLu (???) which again is not clear how it feeds into the prediction of "which unit to use when". The authors clearly demonstrate that the network, or the probabalistic context-free grammar that it controls (?) is capable of straightforward optimizations, like breaking symmetries due to commutativity, avoiding adding zero, avoiding multiplying by one, etc. Beyond this, they do demonstrate via an ablation study that the presence of the neural network affords significant algorithmic leverage in all of the problem domains tested. The network also seems to learn a reasonable representation of the sub-type of task encountered -- but a thorough investigation of how it works, or how it might be made to work better, remains desired.

I've spent a little time looking around the code, which is a mix of python high-level experimental control code, and lower-level OCaml code responsible for running (emulating) the lisp-like DSL, inferring type in it's polymorphic system / reconciling types in evaluated program instances, maintaining the library, and recompressing it using aforementioned version spaces. The code, like many things experimental, is clearly a work-in progress, with some old or unused code scattered about, glue to run the many experiments & record / analyze the data, and personal notes from the first author for making his job talks (! :). The description in the supplemental materials, which is satisfyingly thorough (if again impenetrable wrt version spaces), is readily understandable, suggesting that one (presumably the first) author has a clear understanding of the system. It doesn't appear that much is being hidden or glossed over, which is not the case for all scientific papers.

With the caveat that I don't claim to understand the system to completion, there are some clear areas where the existing system could be augmented further. The 'recognition' or perceptual module, which guides actual synthesis of candidate programs, realistically can use as much information as is available in DreamCoder as is available: full lexical and semantic scope, full input-output specifications, type information, possibly runtime binding of variables when filling holes. This is motivated by the way that humans solve problems, at least as observed by introspection:
  • Examine problem, specification; extract patterns (via perceptual modules)
  • Compare patterns with existing library (memory) of compositionally-factored 'useful solutions' (this is identical to the library in DreamCoder)* Do something like beam-search or quasi stochastic search on selected useful solutions. This is the same as DreamCoder, however human engineers make decisions progressively, at runtime so-to-speak: you fill not one hole per cycle, but many holes. The addition of recursion to DreamCoder, provided a wider breadth of input information, could support this functionality.
  • Run the program to observe input-output .. but also observe the inner workings of the program, eg. dataflow patterns. These dataflow patterns are useful to human engineers when both debugging and when learning-by-inspection what library elements do. DreamCoder does not really have this facility.
  • Compare the current program results to the desired program output. Make a stochastic decision whether to try to fix it, or to try another beam in the search. Since this would be on a computer, this could be in parallel (as DreamCoder is); the ability to 'fix' or change a DUT is directly absent dreamcoder. As an 'deeply philosophical' aside, this loop itself might be the effect of running a language-of-thought program, as was suggested by pioneers in AI (ref). The loop itself is subject to modification and replacement based on goal-seeking success in the domain of interest, in a deeply-satisfying and deeply recursive manner ...
At each stage in the pipeline, the perceptual modules would have access to relevant variables in the current problem-solving context. This is modeled on Jacques Pitrat's work. Humans of course are even more flexible than that -- context includes roughly the whole brain, and if anything we're mushy on which level of the hierarchy we are working.

Critical to making this work is to have, as I've written in my notes many years ago, a 'self compressing and factorizing memory'. The version space magic + library could be considered a working example of this. In the realm of ANNs, per recent OpenAI results with CLIP and Dall-E, really big transformers also seem to have strong compositional abilities, with the caveat that they need to be trained on segments of the whole web. (This wouldn't be an issue here, as Dreamcoder generates a lot of its own training data via dreams). Despite the data-inefficiency of DNN / transformers, they should be sufficient for making something in the spirit of above work, with a lot of compute, at least until more efficient models are available (which they should be shortly; see AlphaZero vs MuZero).

hide / / print
ref: -2020 tags: excitatory inhibitory balance E-I synapses date: 10-06-2021 17:50 gmt revision:1 [0] [head]

Whole-Neuron Synaptic Mapping Reveals Spatially Precise Excitatory/Inhibitory Balance Limiting Dendritic and Somatic Spiking

We mapped over 90,000 E and I synapses across twelve L2/3 PNs and uncovered structured organization of E and I synapses across dendritic domains as well as within individual dendritic segments. Despite significant domain-specific variation in the absolute density of E and I synapses, their ratio is strikingly balanced locally across dendritic segments. Computational modeling indicates that this spatially precise E/I balance dampens dendritic voltage fluctuations and strongly impacts neuronal firing output.

I think this would be tenuous, but they did do patch-clamp recording to back it up, but it's vitally interesting from a structural standpoint. Plus, this is a enjoyable, well-written paper :-)

hide / / print
ref: -2019 tags: HSIC information bottleneck deep learning backprop gaussian kernel date: 10-06-2021 17:23 gmt revision:5 [4] [3] [2] [1] [0] [head]

The HSIC Bottleneck: Deep learning without Back-propagation

In this work, the authors use a kernelized estimate of statistical independence as part of a 'information bottleneck' to set per-layer objective functions for learning useful features in a deep network. They use the HSIC, or Hilbert-schmidt independence criterion, as the independence measure.

The information bottleneck was proposed by Bailek (spikes..) et al in 1999, and aims to increase the mutual information between the layer representation and the labels while minimizing the mutual information between the representation and the input:

minP T i|XI(X;T i)βI(T i;Y)\frac{min}{P_{T_i | X}} I(X; T_i) - \beta I(T_i; Y)

Where T iT_i is the hidden representation at layer i (later output), XX is the layer input, and YY are the labels. By replacing I()I() with the HSIC, and some derivation (?), they show that

HSIC(D)=(m1) 2tr(K XHK YH)HSIC(D) = (m-1)^{-2} tr(K_X H K_Y H)

Where D=(x 1,y 1),...(x m,y m)D = {(x_1,y_1), ... (x_m, y_m)} are samples and labels, K X ij=k(x i,x j)K_{X_{ij}} = k(x_i, x_j) and K Y ij=k(y i,y j)K_{Y_{ij}} = k(y_i, y_j) -- that is, it's the kernel function applied to all pairs of (vectoral) input variables. H is the centering matrix. The kernel is simply a Gaussian kernel, k(x,y)=exp(1/2||xy|| 2/σ 2)k(x,y) = exp(-1/2 ||x-y||^2/\sigma^2) . So, if all the x and y are on average independent, then the inner-product will be mean zero, the kernel will be mean one, and after centering will lead to zero trace. If the inner product is large within the realm of the derivative of the kernel, then the HSIC will be large (and negative, i think). In practice they use three different widths for their kernel, and they also center the kernel matrices.

But still, the feedback is an aggregate measure (the trace) of the product of two kernelized (a nonlinearity) outer-product spaces of similarities between inputs. it's not unimaginable that feedback networks could be doing something like this...

For example, a neural network could calculate & communicate aspects of joint statistics to reward / penalize weights within a layer of a network, and this is parallelizable / per layer / adaptable to an unsupervised learning regime. Indeed, that was done almost exactly by this paper: Kernelized information bottleneck leads to biologically plausible 3-factor Hebbian learning in deep networks albeit in a much less intelligible way.

Robust Learning with the Hilbert-Schmidt Independence Criterion

Is another, later, paper using the HSIC. Their interpretation: "This loss-function encourages learning models where the distribution of the residuals between the label and the model prediction is statistically independent of the distribution of the instances themselves." Hence, given above nomenclature, E X(P T i|XI(X;T i))=0 E_X( P_{T_i | X} I(X ; T_i) ) = 0 (I'm not totally sure about the weighting, but might be required given the definition of the HSIC.)

As I understand it, the HSIC loss is a kernellized loss between the input, output, and labels that encourages a degree of invariance to input ('covariate shift'). This is useful, but I'm unconvinced that making the layer output independent of the input is absolutely essential (??)

hide / / print
ref: -2020 tags: Principe modular deep learning kernel trick MNIST CIFAR date: 10-06-2021 16:54 gmt revision:2 [1] [0] [head]

Modularizing Deep Learning via Pairwise Learning With Kernels

  • Shiyu Duan, Shujian Yu, Jose Principe
  • The central idea here is to re-interpret deep networks, not with the nonlinearity as the output of a layer, but rather as the input of the layer, with the regression (weights) being performed on this nonlinear projection.
  • In this sense, each re-defined layer is implementing the 'kernel trick': tasks (like classification) which are difficult in linear spaces, become easier when projected into some sort of kernel space.
    • The kernel allows pairwise comparisons of datapoints. EG. a radial basis kernel measures the radial / gaussian distance between data points. A SVM is a kernel machine in this sense.
      • As a natural extension (one that the authors have considered) is to take non-pointwise or non-one-to-one kernel functions -- those that e.g. multiply multiple layer outputs. This is of course part of standard kernel machines.
  • Because you are comparing projected datapoints, it's natural to take contrastive loss on each layer to tune the weights to maximize the distance / discrimination between different classes.
    • Hence this is semi-supervised contrastive classification, something that is quite popular these days.
    • The last layer is of tuned with cross-entropy labels, but only a few are required since the data is well distributed already.
  • Demonstrated on small-ish datasets, concordant with their computational resources ...

I think in general this is an important result, even if its not wholly unique / somewhat anticipated (it's a year old at the time of writing). Modular training of neural networks is great for efficiency, parallelization, and biological implementations! Transport of weights between layers is hence non-essential.

Classes still are, but I wonder if temporal continuity can solve some of these problems?

(There is plenty of other effort in this area -- see also {1544})

hide / / print
ref: -2014 tags: CNiFER Kleinfeld dopamine norepinephrine monoamine cell sensor date: 10-04-2021 14:50 gmt revision:2 [1] [0] [head]

Cell-based reporters reveal in vivo dynamics of dopamine and norepinephrine release in murine cortex

  • CNiFERs are clonal cell lines engineered to express a specific GPCR that is coupled to the Gq pathway and triggers an increase in intracellular calcium concentration, [Ca2+], which in turn is rapidly detected by a genetically encoded fluorescence resonance energy transfer (FRET)-based Ca2+ sensor. This system transforms neurotransmitter receptor binding into a change in fluorescence and provides a direct and real-time optical readout of local neurotransmitter activity. Furthermore, by using the natural receptor for a given transmitter, CNiFERs gain the chemical specificity and temporal dynamics present in vivo.
    • Clonal cell line = HEK293.
      • Human cells implanted into mice!
    • Gq pathway = through the phospholipase C-initosol triphosphate (PLC-IP3) pathway.
  • Dopamine sensor required the engineering of a chimeric Gqi5 protein for coupling to PLC. This was a 5-AA substitution (only!)

Referenced -- and used by the recent paper Reinforcement learning links spontaneous cortical dopamine impulses to reward, which showed that dopamine signaling itself can come under volitional, operant-conditioning (or reinforcement type) modulation.

hide / / print
ref: -2011 tags: government polyicy observability submerged state America date: 09-23-2021 22:06 gmt revision:0 [head]

The Submerged State -- How Invisible Government Policies Undermine American Democracy. By Suzanne Mettler

(I've not read this book, just the blurb, but it looks like a defensible thesis) : Government polyicy, rather than distributing resources (money, infrastructure, services) as directly as possible to voters, have recently opted to distribute indirectly, through private companies. This gives the market & private organizations more perceived clout, perpetuates a level of corruption, and undermines American's faith in their government.

So, we need a better 'debugger' for policy in america? Something like a discrete chain rule to help people figure out what policies (and who) are responsible for the good / bad things in their life? Sure seems that the bureaucracy is could use some cleanup / is failing under burgeoning complexity. This is probably not dissimilar to cruddy technical systems.

hide / / print
ref: -2021 tags: gated multi layer perceptrons transformers ML Quoc_Le Google_Brain date: 08-05-2021 06:00 gmt revision:4 [3] [2] [1] [0] [head]

Pay attention to MLPs

  • Using bilinear / multiplicative gating + deep / wide networks, you can attain similar accuracies as Transformers on vision and masked language learning tasks! No attention needed, just a in-network multiplicative term.
  • And the math is quite straightforward. Per layer:
    • Z=σ(XU),,Z^=s(Z),,Y=Z^V Z = \sigma(X U) ,, \hat{Z} = s(Z) ,, Y = \hat{Z} V
      • Where X is the layer input, σ\sigma is the nonlinearity (GeLU), U is a weight matrix, Z^\hat{Z} is the spatially-gated Z, and V is another weight matrix.
    • s(Z)=Z 1(WZ 2+b) s(Z) = Z_1 \odot (W Z_2 + b)
      • Where Z is divided into two parts along the channel dimension, Z 1Z 2Z_1 Z_2 . 'circleDot' is element-wise multiplication, and W is a weight matrix.
  • You of course need a lot of compute; this paper has nice figures of model accuracy scaling vs. depth / number of parameters / size. I guess you can do this if you're Google.

Pretty remarkable that an industrial lab freely publishes results like this. I guess the ROI is that they get the resultant improved ideas? Or, perhaps, Google is in such a dominant position in terms of data and compute that even if they give away ideas and code, provided some of the resultant innovation returns to them, they win. The return includes trained people as well as ideas. Good for us, I guess!

hide / / print
ref: -2018 tags: luke metz meta learning google brain sgd model mnist Hebbian date: 08-05-2021 01:07 gmt revision:2 [1] [0] [head]

Meta-Learning Update Rules for Unsupervised Representation Learning

  • Central idea: meta-train a training-network (a MLP) which trains a task-network (also a MLP) to do unsupervised learning on one dataset.
  • The training network is optimized through SGD based on small-shot linear learning on a test set, typically different from the unsupervised training set.
  • The training-network is a per-weight MLP which takes in layer input, layer output, and a synthetic error (denoted η\eta ), and generates a and b, which are then fed into an outer-product Hebbian learning rule.
  • η\eta itself is formed through a backward pass through weights VV , which affords something like backprop -- but not exactly backprop, of course. See the figure.
  • Training consists of building up very large, backward through time gradient estimates relative to the parameters of the training-network. (And there are a lot!)
  • Trained on CIFAR10, MNIST, FashionMNIST, IMDB sentiment prediction. All have their input permuted to keep the training-network from learning per-task weights. Instead the network should learn to interpret the statistics between datapoints.
  • Indeed, it does this -- albeit with limits. Performance is OK, but only if you only do supervised learning on the very limited dataset used in the meta-optimization.
    • In practice, it's possible to completely solve tasks like MNIST with supervised learning; this gets to about 80% accuracy.
  • Images were kept small -- about 20x20 -- to speed up the inner loop unsupervised learning. Still, this took on the order of 200 hours across ~500 TPUs.
  • See, as a comparison, Keren's paper, Meta-learning biologically plausible semi-supervised update rules. It's conceptually nice but only evaluates the two-moons and two-gaussian datasets.

This is a clearly-written, easy to understand paper. The results are not highly compelling, but as a first set of experiments, it's successful enough.

I wonder what more constraints (fewer parameters, per the genome), more options for architecture modifications (e.g. different feedback schemes, per neurobiology), and a black-box optimization algorithm (evolution) would do?