One model for the learning of language
 Yuan Yang and Steven T. Piantadosi
 Idea: Given a restricted compositional 'mentalese' programming language / substrate, construct a set of grammatical rules ('hypotheses') from a small number of examples of an (abstract) language.
 Pinker's argument that there is too little stimulus ("paucity of stimulus") for children discern grammatical rules, hence they must be innate, is thereby refuted..
 This is not the only refutation.
 An argument was made on Twitter that large language models also refute the paucity of stimuli hypothesis. Meh, this paper does it far better  the data used to train transformers is hardly small.
 Hypotheses are sampled from the substrate using MCMC, and selected based on a smoothed Bayesian likelihood.
 This likelihood takes into account partial hits  results that are within an edit distance of one of the desired sets of strings. (i think)
 They use Parallel tempering to search the space of programs.
 Roughly: keep alive many different hypotheses, and vary the temperatures of each lineage to avoid getting stuck in local minima.
 But there are other search heuristics; see https://codedocs.xyz/piantado/Fleet/
 Excecution is on the CPU, across multiple cores / threads, possibly across multiple servers.
 Larger hypotheses took up to 7 days to find (!)
 These aren't that complicated of grammars..
 See earlier paper {1572}
 This is very similar to {842}, only on grammars rather than continuous signals from MoCap.
 Proves once again that:
 Many domains of the world can be adequately described by relatively simple computational structures (It's a lowD, compositional world out there)
 Or, the JohnsonLindenstrauss lemma
 You can find those hypotheses through bruteforce + heuristic search. (At least to the point that you run into the curse of dimensionality)
A more interesting result is Deep symbolic regression for recurrent sequences, where the authors (facebook/meta) use a Transformer  in this case, directly taken from Vaswini 2017 (8head, 8layer QKV w/ a latent dimension of 512) to do both symbolic (estimate the algebraic recurrence relation) and numeric (estimate the rest of the sequence) training / evaluation. Symbolic regression generalizes better, unsurprisingly. But both can be made to work even in the presence of (logscaled) noise!
While the language learning paper shows that small generative programs can be inferred from a few samples, the Meta symbolic regression shows that Transformers can evince either amortized memory (less likely) or algorithms for perception  both new and interesting. It suggests that 'even' abstract symbolic learning tasks are sufficiently decomposable that the sorts of algorithms available to an 8layer transformer can give a useful search heuristic. (N.B. That the transformer doesn't spit out perfect symbolic or numerical results directly  it also needs postprocessing search. Also, the transformer algorithm has search (in the form of softmax) baked in to it's architecture.)
This is not a light architecture: they trained the transformer for 250 epochs, where each epoch was 5M equations in batches of 512. Each epoch took 1 hour on 16 Volta GPUs w 32GB of memory. So, 4k GPUhours x ~10 TFlops = 1.4e20 Flops. Compare this with grammar learning above; 7 days on 32 cores operating at ~ 3Gops/sec is 1.8e15 ops. Much, much smaller compute.
All of this is to suggest a central theme of computer science: a continuum between search and memorization.
 The language paper does fast search, but does not learn from the process (bootstrap), and maintains little state/memory.
 The symbolic regression paper does moderate amounts of search, but continually learns form the process, and stores a great deal of heuristics for the problem domain.
Most interesting for a visual neuroscientist (not that I'm one per se, but bear with me) is where on these axes (search, heuristic, memory) visual perception is. Clearly there is a high degree of recurrence, and a high degree of plasticity / learning. But is there search or local optimization? Is this coupled to the recurrence via some form of energyminimizing system? Is recurrence approximating EM? 
 The Computational Origin of Representation (2019)
 from Piantandosi, talks a big game... reviews some seminal literature ...
 But the argument reduces to the established idea that you can represent boolean logic and arbitrary algorithms with Church encoding through S and K (and some tortuous symbol manipulation..)
 It seems the Piantadosi was perhaps excited by discovering and understanding combinators?
 It is indeed super neat (and i didn't wade so deep to really understand it), but the backtracking search procedure embodied in pyChuriso is scarcely close to anything happening in our brains (and such backtracking search is common in CS..)
 It is overwhelmingly more likely that we approximate other Turningcomplete computations, by (evolutionary) luck and education.
 The last parts of the paper, describing a continuum between combinators, logic, calculus, tensor approximations, and neuroscience is ... very handwavey, with no implementation.
 If you allow me to be hyypercritical, this paper is an excellent literature review, but limited impact for ML practitioners.
