m8ta
You are not authenticated, login.
text: sort by
tags: modified
type: chronology
{1578}
hide / / print
ref: -0 tags: superposition semantic LLM anthropic scott alexander date: 11-29-2023 23:58 gmt revision:1 [0] [head]

God Help us, let's try to understand AI monosemanticity

Commentary: To some degree, superposition seems like a geometric "hack" invented in the process of optimization to squeeze a great many (largely mutually-exclusive) sparse features into a limited number of neurons. GPT3 has a latent dimension of only 96 * 128 = 12288, and with 96 layers this is only 1.17 M neurons (*). A fruit fly has 100k neurons (and can't speak). All communication must be through that 12288 dimensional vector, which is passed through LayerNorm many times (**), so naturally the network learns to take advantage of locally linear subspaces.

That said, the primate visual system does seem to use superposition, though not via local subspaces; instead, neurons seem to encode multiple axes somewhat linearly (e.g. global spaces: linearly combined position and class) That was a few years ago, and I suspect that new results may contest this. The face area seems to do a good job of disentanglement, for example.

Treating everything as high-dimensional vectors is great for analogy making, like the wife - husband + king = queen example. But having fixed-size vectors for representing arbitrary-dimensioned relationships inevitably leads to compression ~= superposition. Provided those subspaces are semantically meaningful, it all works out from a generalization standpoint -- but this is then equivalent to allocating an additional axis for said relationship or attribute. Additional axes would also put less decoding burden on the downstream layers, and make optimization easier.

Google has demonstrated allocation in transformers. It's also prevalent in the cortex. Trick is getting it to work!

(*) GPT4 is unlikely to have more than an order of magnitude more 'neurons'; PaLM-540B has only 2.17 M. Given that GPT-4 is something like 3-4x larger, it should have 6-8 M neurons, which is still 3 orders of magnitude fewer than the human neocortex (nevermind the cerebellum ;-)

(**) I'm of two minds on LayerNorm. PV interneurons might be seen to do something like this, but it's all local -- you don't need everything to be vector rotations. (LayerNorm effectively removes one degree of freedom, so really it's a 12287 dimensional vector)

Update: After reading https://transformer-circuits.pub/2023/monosemantic-features/index.html, I find the idea of local manifolds / local codes to be quite appealing: why not represent sparse yet conditional features using superposition?  This also expands the possibility of pseudo-hierarchical representation, which is great.

{1577}
hide / / print
ref: -2008 tags: sketch armando solar lezema inductive program synthesis date: 11-11-2023 20:08 gmt revision:5 [4] [3] [2] [1] [0] [head]

Sketch - Program synthesis by sketching

  • Sketch consists of a C-like language with the additional primitives: integer-valued holes ??, reorder, generator (macros where the holes can vary), assert, repeat (also a variable-hole parameterized macro), regular-expression generators for selecting substitute programming primitives, and atomics for concurrency-focused synthesis (fork and compare-and-swap).
    • The other common programming primitives are there, too - if, else, while, procedures, arrays.
      • Loops, arrays, and recursion are limited to a fixed size to constrain the search space.
  • Sketch is inductive - working solutions are derived from the skeleton provided by the programmer + counterexamples provided by a verification function.
    • This is compared to previous work, which focused on deductive synthesis: programs are derived from a constructive proof of the satisfiabiltiy of a problem.
      • This 'proved' challenging in practice -- programmers are much better versed in construction rather than abstract mathematical derivation.
    • "By giving up the deductive approach, we lose correctness by construction, and we are forced to rely on an external validation procedure."
  • So they use CEGIS: CounterExample Guided Inductive Synthesis.
    • For most sketches, only a small set of counterexamples are needed to fully constrain the solution.
    • The counterexamples are fed to the inductive synthesizer so that the next candidate will be guaranteed to work correctly for all.
    • Each counterexample reduces the search space by (typically) 2^15 -- much information for each -- so only a handful of validation calls are usually needed, even when the solution space is very large.
      • E.g. 40 iterations when the candidate space is 2^600
        • Illuminating that computers can so readily handle such large spaces with prefect fidelity!
    • Counterexamples are much easier to find than an abstact model (as in the decuctive approach)
  • To do the synthesis, the sketch is expanded to a 'control': a set of variables, holes and calling contexts.
  • This control includes parameterized values that track the value of each variable, along with operators that define their relation to other variables, holes, and calling contexts.
    • The denotation function translates any expression to a function from a state to a parameterized value.
  • The program sketch is parsed (partially evaluated) to iteratively update σ,Φ\sigma, \Phi , where σ\sigma is a mapping from the set of variable names to parameterized values, and Φ\Phi is the set of control functions (holes and calling contexts).
    • With control flow like while and if, the set of parameterized values σ\sigma (with calling context) and control functions Φ\Phi are unrolled (to a limit) and unioned for the two branches (if & else).
    • Similarly for procedures and generators -- iterative state σ,Φ\sigma, \Phi is updated with additional calling context τ\tau (and bounded recursion / bounded evaluation! - kk is a hyperparameter)
    • Assertions and "ifs" are modeled as intersections on the iterative state.
      • In this way, semantics can model the behavior of all candidate solutions simultaneously -- e.g. Φ\Phi is a set of all possible assignments to the holes.
  • (Chapter 5) In practice: control is represented in an intermediate language, itself represented as a DAG.
    • Several optimizations are applied to make the DAG simpler, e.g. structural hashing to remove common sub-trees, and constant propagation (which may be redundant with internal SAT rules).
    • There seems to be a tension between Chapter 3 (abstract denotational semantics) and Chapter 5 (concrete DAG manipulations); how exactly they relate could have been spelled out better.
      • I.e.: how do the context, parameterized values, and control functions directly lead to the DAG??
    • I infer: the DAG seems to be a representation of the data & code flow graph that makes explicit the relationships between variables, holes, and control structures. (this is never mentioned, but seems to be the meaning of 'denotation'.)
  • The denotation function (compilation or partial evaluation) only needs to be computed once -- subsequent CEGIS loop evaluations only need to update the SAT problem from the counter-examples.
  • Integers are represented via mutually-exclusive & ordered guard bits: if a bit is 1, then the hole or variable evaluates to that integer. Aka one-hot encoding. This permits translation from DAG to CNF.
  • CEGIS starts by finding a solution = integer assignment of holes & permutations that make the SAT problem evaluate to True.
  • Based on this control vector, it then looks for a solution (value of the input variables) to the SAT problem that evaluates as False.
    • Clever re-use of the SAT solver & representation!
  • This counter-example is used to improve the SAT problem = add statements (system must evaluate to true with all input-output counter-examples) = remove hypotheses from Φ\Phi .
  • Did not delve much into the concurrency-related chapters (7-9) or the Stencil scientific computing chapters (10-11).

The essential algorithm, in words:

Take the sketch, expand it to a set of parameterized variables, holes, and calling contexts. Convert these to a DAG aka (?) data-code flow graph w/ dependencies. Try to simplify the DAG, one-hot encode integers, and convert to either a conjunctive-normal-form (CNF) SAT problem for MiniSat, or to a boolean circuit for the ABC solver.

Apply MiniSat or ABC to the problem to select a set of control values = values for the holes & permutations that satisfy the boolean constraints. Using this solution, use the SAT solver to find a input variable configuration that does not satisfy the problem. This serves as a counter-example. Run this through the validator function (oracle) to see what it does; use the counter-example and (inputs and outputs) to add clauses to the SAT problem. Run several times until either no counter-examples can be found or the problem is `unsat`.

Though the thesis describes a system that was academic & relatively small back in 2008, Sketch has enjoyed continuous development, and remains used. I find the work that went into it to be remarkable and impressive -- even with incremental improvements, you need accurate expansion of the language & manipulations to show proof-of-principle. Left wondering what limits its application to even larger problems -- need for a higher-level loop that further subdivides / factorizes the problem, or DFS for filling out elements of the sketch?

Interesting links discovered in while reading the dissertation:

  • Lock-free queues using compare_exchange_weak in C++ (useful!)
  • Karatsuba algorithm for multiplying large n-digit numbers with O(n 1.58)O(n^{1.58}) complexity. (Trick: divide and conquer & use algebra to re-use intermediate results via addition and subtraction)

{1576}
hide / / print
ref: -0 tags: GFlowNet Bengio probabilty modelling reinforcement learing date: 10-29-2023 19:17 gmt revision:3 [2] [1] [0] [head]

GFlowNet Tutorial

  • It's basically like RL, only treating the reward as a scaled unnormalized probability or 'flow'.
  • Unlike RL, GFN are constructive, only add elements (actions), which means the resulting graph is either a DAG or tree. (No state aliasing)
  • Also unlike RL / REINFORCE / Actor-critic, the objective is to match forward and reverse flows, both parameterized by NNs. Hence, rather than BPTT or unrolls, information propagation is via the reverse policy model. This forward-backward difference based loss is reminiscent of self-supervised Barlow Twins, BYOL, Siamese networks, or [1][2]. Bengio even has a paper talking about it [3].
    • The fact that it works well means that it must be doing some sort of useful regularization, which is super interesting.
    • Or it just means there are N+1 ways of skinning the cat!
  • Adopting a TD(λ) TD(\lambda) approach of sampling trajectories for reward back-propagation improves convergence/generalization. Really not that different from RL..
  • At least 4 different objectives (losses):
    • Matching per-state in and out flow
    • Matching per-state forward and backward flow
    • Matching whole-trajectory forward and backward flow
    • Subsampling whole-trajectory and matching their flow.

{1562}
hide / / print
ref: -0 tags: SAT solver blog post date: 10-11-2023 20:24 gmt revision:1 [0] [head]

Modern SAT solvers: fast, neat and underused (part 1 of N)

A set of posts that are worth re-reading.

See also: https://www.borealisai.com/research-blogs/tutorial-11-sat-solvers-iii-factor-graphs-and-smt-solvers/

Of note: Selsam 2019 indicates that Survey propagation (Knuth 2015), an extension to Belief propagation, does not seem to extend well to hard SAT problems.

{1549}
hide / / print
ref: -0 tags: gtk.css scrollbar resize linux qt5 gtk-3 gtk-4 date: 08-22-2023 20:23 gmt revision:4 [3] [2] [1] [0] [head]

Put this in ~/.config/gtk-3.0/gtk.css and ~/.config/gtk-4.0/gtk.css to make scrollbars larger & permanently visible on high-DPI screens. ref

.scrollbar {
  -GtkScrollbar-has-backward-stepper: 1;
  -GtkScrollbar-has-forward-stepper: 1;
  -GtkRange-slider-width: 16;
  -GtkRange-stepper-size: 16;
}
scrollbar slider {
    /* Size of the slider */
    min-width: 16px;
    min-height: 16px;
    border-radius: 16px;

    /* Padding around the slider */
    border: 2px solid transparent;
}

.scrollbar.vertical slider,
scrollbar.vertical slider {
    min-height: 16px;
    min-width: 16px;
}

.scrollbar.horizontal.slider,
scrollbar.horizontal slider {
min-width: 16px;
min-height: 16px;
}

/* Scrollbar trough squeezes when cursor hovers over it. Disabling that
 */

.scrollbar.vertical:hover:dir(ltr),
.scrollbar.vertical.dragging:dir(ltr) {
    margin-left: 0px;
}

.scrollbar.vertical:hover:dir(rtl),
.scrollbar.vertical.dragging:dir(rtl) {
    margin-right: 0px;
}

.scrollbar.horizontal:hover,
.scrollbar.horizontal.dragging,
.scrollbar.horizontal.slider:hover,
.scrollbar.horizontal.slider.dragging {
    margin-top: 0px;
}
undershoot.top, undershoot.right, undershoot.bottom, undershoot.left { background-image: none; }
undershoot.top, undershoot.right, undershoot.bottom, undershoot.left { background-image: none; }

Also add:

export GTK_OVERLAY_SCROLLING=0 
to your ~/.bashrc

This does not work with GTK4, though -- to do that, put the following in ~/.config/gtk-4.0/settings.ini:

[Settings]
gtk-overlay-scrolling = false

To make the scrollbars a bit easier to see in QT5 applications, run qt5ct (after apt-getting it), and add in a new style sheet, /usr/share/qt5ct/qss/scrollbar-simple-backup.qss

/* SCROLLBARS (NOTE: Changing 1 subcontrol means you have to change all of them)*/
QScrollBar{
  background: palette(alternate-base);
}
QScrollBar:horizontal{
  margin: 0px 0px 0px 0px;
}
QScrollBar:vertical{
  margin: 0px 0px 0px 0px;
}
QScrollBar::handle{
  background: #816891;
  border: 1px solid transparent;
  border-radius: 1px;
}
QScrollBar::handle:hover, QScrollBar::add-line:hover, QScrollBar::sub-line:hover{
  background: palette(highlight);
}
QScrollBar::add-line{
subcontrol-origin: none;
}
QScrollBar::add-line:vertical, QScrollBar::sub-line:vertical{
height: 0px;
}
QScrollBar::add-line:horizontal, QScrollBar::sub-line:horizontal{
width: 0px;
}
QScrollBar::sub-line{
subcontrol-origin: none;
}

{1575}
hide / / print
ref: -2020 tags: bessel imaging NaJi synapses SPIE date: 11-11-2022 20:14 gmt revision:0 [head]

High-contrast volumetric imaging of synapses and microcirculations with Bessel-droplet two-photon fluorescence microscopy

  • They interferometrically scan the optimized Bessel droplets along the axial dimension to image synapses volumetrically.
  • "Here, we describe axially extended Bessel-droplet foci with suppressed side rings and more resistance to optical aberrations"

{1574}
hide / / print
ref: -0 tags: ocaml application functional programming date: 10-11-2022 21:36 gmt revision:2 [1] [0] [head]

https://stackoverflow.com/questions/26475765/ocaml-function-with-variable-number-of-arguments

From this I learned that in ocaml you can return not just functions (e.g. currying) but appliations of yet-to-be named functions.

let sum f = f 0 ;;
let arg a b c = c ( b + a ) ;;
let z a = a ;;

then

sum (arg 1) ;; 

is well-typed as (int -> `a) -> `a = <fun> e.g. an application of a function that converts int to `a. Think of it as the application of Xa to argument ( 0 + 1 ), where Xa is the argument (per type signature). Zero is supplied by the definition of 'sum'.

 sum (arg 1) (arg 2);; 

can be parsed as

(sum (arg 1)) (arg 2) ;; 

'(arg 2)' outputs an application of an int & a yet-to be determined function to 'a,

E.g. it's typed as int -> (int -> `a) -> `a = <fun>. So, you can call it Xa passed to above.

Or, Xa = Xb( ( 0 + 1 ) + 2)

where, again, Xb is a yet-to-be defined function that is supplied as an argument.

Therefore, you can collapse the whole chain with the identity function z. But, of course, it could be anything else -- square root perhaps for MSE?

All very clever.

{1573}
hide / / print
ref: -2022 tags: adipose tissue micro-RNA miRNA extracellular vesicles diabetes date: 09-12-2022 18:00 gmt revision:0 [head]

PMID-36070680 Extracellular vesicles mediate the communication of adipose tissue with brain and promote cognitive impairment associated with insulin resistance

  • Claim that adipose tissue communicates with the brain through blood-borne extracellular vesicles containing miRNA.
  • These EV + miRNA impairs cognitive function and synaptic loss.
  • File this under 'not sure if i believe it / but sure is interesting if true'!

{1571}
hide / / print
ref: -2022 tags: language learning symbolic regression Fleet meta search Piantadosi date: 09-05-2022 01:57 gmt revision:5 [4] [3] [2] [1] [0] [head]

One model for the learning of language

  • Yuan Yang and Steven T. Piantadosi
  • Idea: Given a restricted compositional 'mentalese' programming language / substrate, construct a set of grammatical rules ('hypotheses') from a small number of examples of an (abstract) language.
    • Pinker's argument that there is too little stimulus ("paucity of stimulus") for children discern grammatical rules, hence they must be innate, is thereby refuted..
      • This is not the only refutation.
      • An argument was made on Twitter that large language models also refute the paucity of stimuli hypothesis. Meh, this paper does it far better -- the data used to train transformers is hardly small.
  • Hypotheses are sampled from the substrate using MCMC, and selected based on a smoothed Bayesian likelihood.
    • This likelihood takes into account partial hits -- results that are within an edit distance of one of the desired sets of strings. (i think)
  • They use Parallel tempering to search the space of programs.
    • Roughly: keep alive many different hypotheses, and vary the temperatures of each lineage to avoid getting stuck in local minima.
    • But there are other search heuristics; see https://codedocs.xyz/piantado/Fleet/
  • Excecution is on the CPU, across multiple cores / threads, possibly across multiple servers.
  • Larger hypotheses took up to 7 days to find (!)
    • These aren't that complicated of grammars..
  • See earlier paper {1572}

  • This is very similar to {842}, only on grammars rather than continuous signals from MoCap.
  • Proves once again that:
    1. Many domains of the world can be adequately described by relatively simple computational structures (It's a low-D, compositional world out there)
      1. Or, the Johnson-Lindenstrauss lemma
    2. You can find those hypotheses through brute-force + heuristic search. (At least to the point that you run into the curse of dimensionality)

A more interesting result is Deep symbolic regression for recurrent sequences, where the authors (facebook/meta) use a Transformer -- in this case, directly taken from Vaswini 2017 (8-head, 8-layer QKV w/ a latent dimension of 512) to do both symbolic (estimate the algebraic recurrence relation) and numeric (estimate the rest of the sequence) training / evaluation. Symbolic regression generalizes better, unsurprisingly. But both can be made to work even in the presence of (log-scaled) noise!

While the language learning paper shows that small generative programs can be inferred from a few samples, the Meta symbolic regression shows that Transformers can evince either amortized memory (less likely) or algorithms for perception -- both new and interesting. It suggests that 'even' abstract symbolic learning tasks are sufficiently decomposable that the sorts of algorithms available to an 8-layer transformer can give a useful search heuristic. (N.B. That the transformer doesn't spit out perfect symbolic or numerical results directly -- it also needs post-processing search. Also, the transformer algorithm has search (in the form of softmax) baked in to it's architecture.)

This is not a light architecture: they trained the transformer for 250 epochs, where each epoch was 5M equations in batches of 512. Each epoch took 1 hour on 16 Volta GPUs w 32GB of memory. So, 4k GPU-hours x ~10 TFlops = 1.4e20 Flops. Compare this with grammar learning above; 7 days on 32 cores operating at ~ 3Gops/sec is 1.8e15 ops. Much, much smaller compute.

All of this is to suggest a central theme of computer science: a continuum between search and memorization.

  • The language paper does fast search, but does not learn from the process (bootstrap), and maintains little state/memory.
  • The symbolic regression paper does moderate amounts of search, but continually learns form the process, and stores a great deal of heuristics for the problem domain.

Most interesting for a visual neuroscientist (not that I'm one per se, but bear with me) is where on these axes (search, heuristic, memory) visual perception is. Clearly there is a high degree of recurrence, and a high degree of plasticity / learning. But is there search or local optimization? Is this coupled to the recurrence via some form of energy-minimizing system? Is recurrence approximating E-M?

{1572}
hide / / print
ref: -2019 tags: Piantadosi cogntion combinators function logic date: 09-05-2022 01:57 gmt revision:0 [head]

  • The Computational Origin of Representation (2019)
  • from Piantandosi, talks a big game... reviews some seminal literature ...
    • But the argument reduces to the established idea that you can represent boolean logic and arbitrary algorithms with Church encoding through S and K (and some tortuous symbol manipulation..)
    • It seems the Piantadosi was perhaps excited by discovering and understanding combinators?
      • It is indeed super neat (and i didn't wade so deep to really understand it), but the backtracking search procedure embodied in pyChuriso is scarcely close to anything happening in our brains (and such backtracking search is common in CS..)
      • It is overwhelmingly more likely that we approximate other Turning-complete computations, by (evolutionary) luck and education.
      • The last parts of the paper, describing a continuum between combinators, logic, calculus, tensor approximations, and neuroscience is ... very hand-wavey, with no implementation.
        • If you allow me to be hyypercritical, this paper is an excellent literature review, but limited impact for ML practitioners.

{1570}
hide / / print
ref: -0 tags: Balduzzi backprop biologically plausible red-tape date: 05-31-2022 20:48 gmt revision:1 [0] [head]

Kickback cuts Backprop's red-tape: Biologically plausible credit assignment in neural networks

Bit of a meh -- idea is, rather than propagating error signals backwards through a hierarchy, you propagate only one layer + use a signed global reward signal. This works by keeping the network ‘coherent’ -- positive neurons have positive input weights, and negative neurons have negative weights, such that the overall effect of a weight change does not change sign when propagated forward through the network.

This is kind of a lame shortcut, imho, as it limits the types of functions that the network can model & the computational structure of the network. This is already quite limited by the dot-product-rectifier common structure (as is used here). Much more interesting and possibly necessary (given much deeper architectures now) is to allow units to change sign. (Open question as to whether they actually frequently do!). As such, the model is in the vein of "how do we make backprop biologically plausible by removing features / communication" rather than "what sorts of signals and changes does the brain use perceive and generate behavior".

This is also related to the literature on what ResNets do; what are the skip connections for? Amthropic has some interesting analyses for Transformer architectures, but checking the literature on other resnets is for another time.

{1569}
hide / / print
ref: -2022 tags: symbolic regression facebook AI transformer date: 05-17-2022 20:25 gmt revision:0 [head]

Deep symbolic regression for recurrent sequences

Surprisingly, they do not do any network structure changes; it’s Vaswini 2017w/ a 8-head, 8 layer transformer (sequence to sequence, not decoder only) with a latent dimension of 512.  Significant work was in feature / representation engineering (e.g. base-10k representations of integers and fixed-precision representations of floating-point numbers. (both of these involve a vocabulary size of ~10k ... amazing still that this works..)) + the significant training regimen they worked with (16 Turing GPUs, 32gb ea).  Note that they do perform a bit of beam-search over the symbolic regressions by checking how well each node fits to the starting sequence, but the models work even without this degree of refinement. (As always, there undoubtedly was significant effort spent in simply getting everything to work)

The paper does both symbolic (estimate the algebraic recurence relation) and numeric (estimate the rest of the sequence) training / evaluation. Symbolic regression generalizes better, unsurprisingly. But both can be made to work even in the presence of (log-scaled) noise!

Analysis of how the transformers work for these problems is weak; only one figure showing that the embeddings of the integers follows some meandering but continuous path in t-SNE space. Still, the trained transformer is able to usually best hand-coded sequence inference engine(s) in Mathematica, and does so without memorizing all of the training data. Very impressive and important result, enough to convince that this learned representation (and undiscovered cleverness, perhaps) beats human mathematical engineering, which probably took longer and took more effort.

It follows, without too much imagination (but vastly more compute), that you can train an 'automatic programmer' in the very same way.

{1568}
hide / / print
ref: -2021 tags: burst bio plausible gradient learning credit assignment richards apical dendrites date: 05-05-2022 15:44 gmt revision:2 [1] [0] [head]

Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits

  • Roughly, single-events indicate the normal feature responses of neurons, while multiple-spike bursts indicate error signals.
  • Bursts are triggered by depolarizing currents to the apical dendrites, which can be uncoupled from bottom-up event rate, which arises from perisomatic inputs / basal dendrites.
  • The fact that the two are imperfectly multiplexed is OK, as in backprop the magnitude of the error signal is modulated by the activity of the feature detector.
  • "For credit assignment in hierarchical networks, connections should obey four constraints:
    • Feedback must steer the magnitude and sign of plasticity
    • Feedback signals from higher-order areas must be multipleed with feedforward signals from lower-order areas so that credit assignment can percolate down the hierarch with minimal effect on sensory information
    • There should be some form of alignment between feedforward and feedback connections
    • Integration of credit-carrying signals should be nearly linear to avoid saturation
      • Seems it's easy to saturate the burst probability within a window of background event rate, e.g. the window is all bursts to no bursts.
  • Perisomatic inputs were short-term depressing, whereas apical dendrite synapses were short-term facilitating.
    • This is a form of filtering on burst rates? E.g. the propagate better down than up?
  • They experiment with a series of models, one for solving the XOR task, and subsequent for MNIST and CIFAR.
  • The later, larger models are mean-field models, rather than biophysical neuron models, and have a few extra features:
    • Interneurons, presumably SOM neurons, are used to keep bursting within a linear regime via a 'simple' (supplementary) learning rule.
    • Feedback alignment occurs by adjusting both the feedforward and feedback weights with the same propagated error signal + weight decay.
  • The credit assignment problem, or in the case of unsupervised learning, the coordination problem, is very real: how do you change a middle-feature to improve representations in higher (and lower) levels of the hierarchy?
    • They mention that using REINFORCE on the same network was unable to find a solution.
    • Put another way: usually you need to coordinate the weight changes in a network; changing weights individually based on a global error signal (or objective function) does not readily work...
      • Though evolution seems to be quite productive at getting the settings of (very) large sets of interdependent coefficients all to be 'correct' and (sometimes) beautiful.
      • How? Why? Friston's free energy principle? Lol.

{1567}
hide / / print
ref: -0 tags: evolution simplicity symmetry kolmogorov complexity polyominoes protein interactions date: 04-21-2022 18:22 gmt revision:5 [4] [3] [2] [1] [0] [head]

Symmetry and simplicity spontaneously emerge from the algorithmic nature of evolution

  • Central hypothesis is that simplicity and symmetry arrive not through natural selection, but because these form are overwhelmingly represented in the genotype-phenotype map
  • Experimental example here was "polyominoes", where there are N=16 tiles, each with a 4 numbers (encoded with e.g. 6-bit binary numbers). The edge numbers determine how the tiles irreversibly bind, e.g. 1 <-> 2, 3 <-> 4 etc, with 4 and 2^6-1 binding to nothing.
  • These tiles are allowed to 'randomly' self-assemble. Some don't terminate (e.g. they form continuous polymers); these are discarded; others do terminate (no more available binding sites).
  • They assessed the complexity of both polyominoes selected for a particular size, eg 16 tiles, or those not selected at all, other than terminating.
  • In both complexity was assessed based on how many actual interactions were needed to make the observed structure. That is, they removed tile edge numbers and kept it if it affected the n-mer formation.
  • Result was this nice log-log plot:
  • Showed that this same trend holds for protein-protein complexes (weaker result, imho)
  • As well as RNA secondary structure
  • And metabolic time-series in a ODE modeled on yeast metabolism (even weaker result..)

The paper features a excellent set of references, including:
Letter to a friend following her article Machine learning in evolutionary studies comes of age

Read your PNAS article last night, super interesting that you can get statistical purchase on long-lost evolutionary 'sweeps' via GANs and other neural network models.  I feel like there is some sort of statistical power issue there?  DNNs are almost always over-parameterized... slightly suspicious.

This morning I was sleepily mulling things over & thought about a walking conversation that we had a long time ago in the woods of NC:  Why is evolution so effective?  Why does it seem to evolve to evolve?  Thinking more -- and having years more perspective -- it seems almost obvious in retrospect: it's a consequence of Bayes' rule.  Evolution finds solutions in spaces that have overwhelming prevalence of working solutions.  The prior has an extremely strong effect.  These representational / structural spaces by definition have many nearby & associated solutions, hence appear post-hoc 'evolvable'.  (You probably already know this.)

I think proteins very much fall into this category: AA were added to the translation machinery based on ones that happened to solve a particular problem... but because of the 'generalization prior' (to use NN parlance), they were useful for many other things.  This does not explain the human-engineering-like modularity of mature evolved systems, but maybe that is due to the strong simplicity prior [1]

Very very interesting to me is how the science of evolution and neural networks are drawing together, vis a vis the lottery ticket hypothesis.  Both evince a continuum of representational spaces, too, from high-dimensional vectoral (how all modern deep learning systems work) to low-dimensional modular, specific, and general (phenomenological human cognition).  I suspect that evolution uses a form of this continuum, as seen in the human high-dimensional long-range gene regulatory / enhancer network (= a structure designed to evolve).  Not sure how selection works here, though; it's hard to search a high-dimensional space.  The brain has an almost identical problem: it's hard to do 'credit assignment' in a billions-large, deep and recurrent network.  Finding which set of synapses caused a good / bad behaviior takes a lot of bits.

{1566}
hide / / print
ref: -1992 tags: evolution baldwin effect ackley artificial life date: 03-21-2022 23:20 gmt revision:0 [head]

Interactions between learning and evolution

  • Ran simulated evolution and learning on a population of agents over ~100k lifetimes.
  • Each agent can last several hundred timesteps with a gridworld like environment.
  • Said gridworld environment has plants (food), trees (shelter), carnivores, and other agents (for mating)
  • Agent behavior is parameterized by an action network and a evaluation network.
    • The action network transforms sensory input into actions
    • The evaluation network sets the valence (positive or negative) of the sensory signals
      • This evaluation network modifies the weights of the action network using a gradient-based RL algorithm called CRBP (complementary reinforcement back-propagation) which reinforces based on the temporal derivative, and complements (negative) when action does not increase reward, with some e-greedy exploration.
        • It's not perfect, but as they astutely say, any reinforcement learning algorithm involves some search, so generally heuristics are required to select new actions in the face of uncertainty.
      • Observe that it seems easier to make a good evaluation network than action network (evaluation network is lower dimensional -- one output!)
    • Networks are implemented as one-layer perceptrons (boring, but they had limited computational resources back then)
  • Showed (roughly) that in winner populations you get:
    • When learning is an option, the population will learn, and with time this will grow to anticipation / avoidance
    • This will transition to the Baldwin effect; learned behavior becomes instinctive
      • But, interestingly, only when the problem is incompletely solved!
      • If it's completely solved by learning (eg super fast), then there is no selective leverage on innate behavior over many generations.
      • Likewise, the survival problem to be solved needs to be stationary and consistent for long enough for the Baldwin effect to occur.
    • Avoidance is a form of shielding, and learning no longer matters on this behavior
    • Even longer term, shielding leads to goal regression: avoidance instincts allow the evaluation network to do something else, set new goals.
      • In their study this included goals such as approaching predators (!).

Altogether (historically) interesting, but some of these ideas might well have been anticipated by some simple hand calculations.

{1565}
hide / / print
ref: -0 tags: nvidia gpuburn date: 02-03-2022 20:27 gmt revision:6 [5] [4] [3] [2] [1] [0] [head]

Compiling a list of saturated matrix-matrix gflops for various Nvidia GPUs.

  1. GTX 1050 Mobile in a Lenovo Yoga 720
    1. ?? W
    2. 1640 Gflops/sec (float)
    3. 65 Gflops/sec (double)
    4. 2 GBb ram, 640 cores, ?? clock / (64C)
  2. T2000 in a Lenovo P1 Gen 3
    1. 34 W
    2. 2259 GFlops/sec (float)
    3. 4 Gb ram, 1024 cores, clock 1185 / 7000 MHz
  3. GTX 1650 Max-Q in a Lenovo X1 extreme Gen 3
    1. 35 W
    2. 2580 GFlops/sec (float)
    3. 116 GFlops/sec (double)
    4. 4 Gb ram, 1024 cores, clock 1335 (float) 1860 (double) / 10000 MHz (56C)
  4. RTX 3080 in a MSI Creator 17
    1. 80 W
    2. 5400 GFlops/sec (float)
    3. 284 GFlops/sec (double)
    4. 16 Gb ram, 6144 cores, clock 855 (float) 1755 (double) / 12000 MHz (68C)
      1. Notable power / thermal throttling on this laptop.
  5. EVGA RTX 2080Ti
    1. 260 W
    2. 11800 GFlops / sec (float)
    3. 469 GFlops / sec (double)
    4. 11 Gb ram, 4352 cores, clock 1620 (float) 1905 (double) / 13600 MHz (74C)

{1564}
hide / / print
ref: -2008 tags: t-SNE dimensionality reduction embedding Hinton date: 01-25-2022 20:39 gmt revision:2 [1] [0] [head]

“Visualizing data using t-SNE”

  • Laurens van der Maaten, Geoffrey Hinton.
  • SNE: stochastic neighbor embedding, Hinton 2002.
  • Idea: model the data conditional pairwise distribution as a gaussian, with one variance per data point, p(x i|x j) p(x_i | x_j)
  • in the mapped data, this pairwise distribution is modeled as a fixed-variance gaussian, too, q(y i|y j) q(y_i | y_j)
  • Goal is to minimize the Kullback-Leibler divergence Σ iKL(p i||q i) \Sigma_i KL(p_i || q_i) (summed over all data points)
  • Per-data point variance is found via binary search to match a user-specified perplexity. This amounts to setting a number of nearest neighbors, somewhere between 5 and 50 work ok.
  • Cost function is minimized via gradient descent, starting with a random distribution of points yi, with plenty of momentum to speed up convergence, and noise to effect simulated annealing.
  • Cost function is remarkably simple to reduce, gradient update: δCδy i=2Σ j(p j|iq ji+p i|jq i|j)(y iy j) \frac{\delta C}{\delta y_i} = 2 \Sigma_j(p_{j|i} - q_{j-i} + p_{i|j} - q_{i|j})(y_i - y_j)
  • t-SNE differs from SNE (above) in that it addresses difficulty in optimizing the cost function, and crowding.
    • Uses a simplified symmetric cost function (symmetric conditional probability, rather than joint probability) with simpler gradients
    • Uses the student’s t-distribution in the low-dimensional map q to reduce crowding problem.
  • The crowding problem is roughly resultant from the fact that, in high-dimensional spaces, the volume of the local neighborhood scales as r m r^m , whereas in 2D, it’s just r 2 r^2 . Hence there is cost-incentive to pushing all the points together in the map -- points are volumetrically closer together in high dimensions than they can be in 2D.
    • This can be alleviated by using a one-DOF student distribution, which is the same as a Cauchy distribution, to model the probabilities in map space.
  • Smart -- they plot the topology of the gradients to gain insight into modeling / convergence behavior.
  • Don’t need simulated annealing due to balanced attractive and repulsive effects (see figure).
  • Enhance the algorithm further by keeping it compact at the beginning, so that clusters can move through each other.
  • Look up: d-bits parity task by Bengio 2007

{1563}
hide / / print
ref: -0 tags: date: 01-09-2022 19:04 gmt revision:1 [0] [head]

The Sony Xperia XZ1 compact is a better phone than an Apple iPhone 12 mini

I don't normally write any personal options here -- just half-finished paper notes riddled with typos (haha) -- but this one has been bothering me for a while.

November 2020 I purchased an iPhone 12 mini to replace my aging Sony Xperia XZ1 compact. (Thinking of staying with Android, I tried out a Samsung S10e as well, but didn't like it.) Having owned and used the iPhone for a year and change, I still prefer the Sony. Here is why:

  • Touch screen
    • The iPhone is MUCH more sensitive to sweat than the Sony
    • This is the biggest problem, since I like to move (hike, bike, kayak etc), it lives in my pocket, and inevitably gets a bit of condensation or water on it.
    • The ipPhone screen is rendered frustrating to use with even an imperceptible bit of moisture on it.
      • Do iPhone users not sweat?
      • Frequently I can't even select the camera app! Or switch to maps!
        • A halfway fix is to turn the screen off then on again. Halfway.
    • The Sony, in comparison, is relatively robust, and works even to the point where there were droplets of water on it.
  • Size
    • They are both about the same size with a case, Sony is 129 x 65 x 9.3 mm ; iPhone mini is 131.5 x 64.2 x 7.4mm.
    • This size is absolutely perfect and manufacturers need to make more phones with these dimensions!
    • If anything, the iPhone is better here -- the rounded corners are nice.
  • Battery
    • Hands down, the Sony. Lasts >=2x as long as the iPhone.
  • Processor
    • Both are fast enough.
  • Software
    • Apple is not an ecosystem. No. It's a walled garden where a select few plants may grow. You do what Apple wants you to do.
      • E.g. want to use any Google apps on iPhone? No problem! Want to use any Apple apps on Android or web or PC? Nope, sorry, you have to buy a $$$ MacBook pro.
    • Ok, the privacy on an iPhone is nice. Modulo that bit where they scanned our photos.
      • As well as the ability to manage notifications & basically turn them all off :)
    • There are many more apps on Android, and they are much less restricted in what they can do.
      • For example, recently we were in the desert & wanted a map of where the cell signal was strong, for remote-working. This is easy on Android (there is an app for it).
        • This is impossible on iPhone (the apps don't have access to the information).
      • Second example, you can ssh into an Android and use that to download large files (e.g. packages, datasets) to avoid using limited tethering data.
        • This is also impossible on iPhone.
    • Why does iMessage make all texts from Android users yucky green? Why is there zero option to change this?
    • Why does iMessage send very low resolution photos to my friends and family using Android? It sends beautiful full-res photos to other Apple phones.
    • Why is there no web interface to iMessage?
      • Ugh, this iPhone is such an elitist snob.
    • You can double-tap on the square in Android to quickly switch between apps, which is great.
    • Apple noticeably auto-corrects to a smaller vocabulary than desired. Android is less invasive in this respect.
  • Cell signal
    • They are similarly unreliable, though the iPhone has 5G & many more wireless band, which is great.
    • Still, frequently I'll have one-two bars of connectivity & yet Google Maps will say "you are offline". This is much less frequent on the Sony.
  • Screen
    • iPhone screen is better.
  • Camera
    • iPhone camera is very very much better.
  • Speaker
    • iPhone speaker much better. But it sure burns the battery.
  • Wifi
    • iPhone will periodically disconnect from Wifi when on Facetime calls. Sony doesn't do this.
      • Facetime only works with Apple devices.
  • Price
    • Sony wins
  • Unlock
    • Face unlock is a cool idea, but we all wear masks now.
    • The Sony has a fingerprint sensor, which is better.
      • In the case where I'm moving (and possibly sweaty), Android is smart enough to allow quick unlock, for access to the camera app or maps. Great feature.

Summary: I'll try to get my moneys worth out of the iPhone; when it dies, will buy the smallest waterproof Android phone that supports my carrier's bands.

{1561}
hide / / print
ref: -0 tags: date: 01-09-2022 19:03 gmt revision:1 [0] [head]

Cortical response selectivity derives from strength in numbers of synapses

  • Benjamin Scholl, Connon I. Thomas, Melissa A. Ryan, Naomi Kamasawa & David Fitzpatrick
  • "Using electron microscopy reconstruction of individual synapses as a metric of strength, we find no evidence that strong synapses have a predominant role in the selectivity of cortical neuron responses to visual stimuli. Instead, selectivity appears to arise from the total number of synapses activated by different stimuli."
  • "Our results challenge the role of Hebbian mechanisms in shaping neuronal selectivity in cortical circuits, and suggest that selectivity reflects the co-activation of large populations of presynaptic neurons with similar properties and a mixture of strengths. "
    • Interesting -- so this is consistent with ANNs / feature detectors / vector hypothesis.
    • It would imply that the mapping is dense rather than sparse -- but to see this, you'd need to record the activity of all these synapses in realtime.
      • Which is possible, (e.g. lightbeads, fast axial focusing), just rather difficult for now.
  • To draw really firm conclusions, would need a thorough stimulus battery, not just drifting gratings.
    • It may change this result: "Surprisingly, the strength of individual synapses was uncorrelated with functional similarity to the somatic output (that is, absolute orientation preference difference)"

{842}
hide / / print
ref: work-0 tags: distilling free-form natural laws from experimental data Schmidt Cornell automatic programming genetic algorithms date: 12-30-2021 05:11 gmt revision:7 [6] [5] [4] [3] [2] [1] [head]

Distilling free-form natural laws from experimental data

  • The critical step was to use the full set of all pairs of partial derivatives ( δx/δy\delta x / \delta y ) to evaluate the search for invariants.
  • The selection of which partial derivatives are held to be independent / which variables are dependent is a bit of a trick too -- see the supplemental information.
    • Even yet, with a 4D data set the search for natural laws took ~ 30 hours.
  • This was via a genetic algorithm, distributed among 'islands' on different CPUs, with mutation and single-point crossover.
  • Not sure what the IL is, but it appears to be floating-point assembly.
  • Timeseries data is smoothed with Loess smoothing, which fits a polynomial to the data, and hence allows for smoother / more analytic derivative calculation.
    • Then again, how long did it take humans to figure out these invariants? (Went about it in a decidedly different way..)
    • Further, how long did it take for biology to discover similar 'design equations'?
      • The same algorithm has been applied to biological data - a metabolic pathway - with some success pub 2011.
      • Of course evolution had to explore a much larger space - proteins and regulatory pathways, not simpler mathematical expressions / linkages.


Since his Phd, Michael Schmidt has gone on to found Nutonian, which produced Eurequa software, apparently without dramatic new features other than being able to use the cloud for equation search. (Probably he improved many other detailed facets of the software..). Nutonian received $4M in seed funding, according to Crunchbase.

In 2017, Nutonian was acquired by Data Robot (for an undisclosed amount), where Michael has worked since, rising to the title of CTO.

Always interesting to follow up on the authors of these classic papers!