m8ta
use https for features.
text: sort by
tags: modified
type: chronology
{1453}
hide / / print
ref: -2019 tags: lillicrap google brain backpropagation through time temporal credit assignment date: 03-14-2019 20:24 gmt revision:2 [1] [0] [head]

PMID-22325196 Backpropagation through time and the brain

  • Timothy Lillicrap and Adam Santoro
  • Backpropagation through time: the 'canonical' expansion of backprop to assign credit in recurrent neural networks used in machine learning.
    • E.g. variable rol-outs, where the error is propagated many times through the recurrent weight matrix, W TW^T .
    • This leads to the exploding or vanishing gradient problem.
  • TCA = temporal credit assignment. What lead to this reward or error? How to affect memory to encourage or avoid this?
  • One approach is to simply truncate the error: truncated backpropagation through time (TBPTT). But this of course limits the horizon of learning.
  • The brain may do BPTT via replay in both the hippocampus and cortex Nat. Neuroscience 2007, thereby alleviating the need to retain long time histories of neuron activations (needed for derivative and credit assignment).
  • Less known method of TCA uses RTRL Real-time recurrent learning forward mode differentiation -- δh t/δθ\delta h_t / \delta \theta is computed and maintained online, often with synaptic weight updates being applied at each time step in which there is non-zero error. See A learning algorithm for continually running fully recurrent neural networks.
    • Big problem: A network with NN recurrent units requires O(N 3)O(N^3) storage and O(N 4)O(N^4) computation at each time-step.
    • Can be solved with Unbiased Online Recurrent optimization, which stores approximate but unbiased gradient estimates to reduce comp / storage.
  • Attention seems like a much better way of approaching the TCA problem: past events are stored externally, and the network learns a differentiable attention-alignment module for selecting these events.
    • Memory can be finite size, extending, or self-compressing.
    • Highlight the utility/necessity of content-addressable memory.
    • Attentional gating can eliminate the exploding / vanishing / corrupting gradient problems -- the gradient paths are skip-connections.
  • Biologically plausible: partial reactivation of CA3 memories induces re-activation of neocortical neurons responsible for initial encoding PMID-15685217 The organization of recent and remote memories. 2005

  • I remain reserved about the utility of thinking in terms of gradients when describing how the brain learns. Correlations, yes; causation, absolutely; credit assignment, for sure. Yet propagating gradients as a means for changing netwrok weights seems at best a part of the puzzle. So much of behavior and internal cognitive life involves explicit, conscious computation of cause and credit.
  • This leaves me much more sanguine about the use of external memory to guide behavior ... but differentiable attention? Hmm.

{1440}
hide / / print
ref: -2017 tags: attention transformer language model youtube google tech talk date: 02-26-2019 20:28 gmt revision:3 [2] [1] [0] [head]

Attention is all you need

  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
  • Attention is all you need neural network models
  • Good summary, along with: The Illustrated Transformer (please refer to this!)
  • Łukasz Kaiser mentions a few times how fragile the network is -- how easy it is to make something that doesn't train at all, or how many tricks by google experts were needed to make things work properly. it might be bravado or bluffing, but this is arguably not the way that biology fails.
  • Encoding:
  • Input is words encoded as 512-length vectors.
  • Vectors are transformed into length 64 vectors: query, key and value via differentiable weight matrices.
  • Attention is computed as the dot-product of the query (current input word) with the keys (values of the other words).
    • This value is scaled and passed through a softmax function to result in one attentional signal scaling the value.
  • Multiple heads' output are concatenated together, and this output is passed through a final weight matrix to produce a final value for the next layer.
    • So, attention in this respect looks like a conditional gain field.
  • 'Final value' above is then passed through a single layer feedforward net, with resnet style jump.
  • Decoding:
  • Use the attentional key value from the encoder to determine the first word through the output encoding (?) Not clear.
  • Subsequent causal decodes depend on the already 'spoken' words, plus the key-values from the encoder.
  • Output is a one-hot softmax layer from a feedforward layer; the sum total is differentiable from input to output using cross-entropy loss or KL divergence.

{1174}
hide / / print
ref: -0 tags: Hinton google tech talk dropout deep neural networks Boltzmann date: 02-12-2019 08:03 gmt revision:2 [1] [0] [head]

Brains, sex, and machine learning -- Hinton google tech talk.

  • Hinton believes in the the power of crowds -- he thinks that the brain fits many, many different models to the data, then selects afterward.
    • Random forests, as used in predator, is an example of this: they average many simple to fit and simple to run decision trees. (is apparently what Kinect does)
  • Talk focuses on dropout, a clever new form of model averaging where only half of the units in the hidden layers are trained for a given example.
    • He is inspired by biological evolution, where sexual reproduction often spontaneously adds or removes genes, hence individual genes or small linked genes must be self-sufficient. This equates to a 'rugged individualism' of units.
    • Likewise, dropout forces neurons to be robust to the loss of co-workers.
    • This is also great for parallelization: each unit or sub-network can be trained independently, on it's own core, with little need for communication! Later, the units can be combined via genetic algorithms then re-trained.
  • Hinton then observes that sending a real value p (output of logistic function) with probability 0.5 is the same as sending 0.5 with probability p. Hence, it makes sense to try pure binary neurons, like biological neurons in the brain.
    • Indeed, if you replace the backpropagation with single bit propagation, the resulting neural network is trained more slowly and needs to be bigger, but it generalizes better.
    • Neurons (allegedly) do something very similar to this by poisson spiking. Hinton claims this is the right thing to do (rather than sending real numbers via precise spike timing) if you want to robustly fit models to data.
      • Sending stochastic spikes is a very good way to average over the large number of models fit to incoming data.
      • Yes but this really explains little in neuroscience...
  • Paper referred to in intro: Livnat, Papadimitriou and Feldman, PMID-19073912 and later by the same authors PMID-20080594
    • A mixability theory for the role of sex in evolution. -- "We define a measure that represents the ability of alleles to perform well across different combinations and, using numerical iterations within a classical population-genetic framework, show that selection in the presence of sex favors this ability in a highly robust manner"
    • Plus David MacKay's concise illustration of why you need sex, pg 269, __Information theory, inference, and learning algorithms__
      • With rather simple assumptions, asexual reproduction yields 1 bit per generation,
      • Whereas sexual reproduction yields G\sqrt G , where G is the genome size.

{1336}
hide / / print
ref: -0 tags: google glucose sensing contact lens date: 04-28-2016 19:41 gmt revision:2 [1] [0] [head]

A contact lens with embedded sensor for monitoring tear glucose level

  • PMID-21257302
  • Metal stack: Ti 10nM / Pd 10nM / Pt 100nm.
  • on a 100um thick PET film.
  • A 30 µL glucose oxidase solution (10 mg/mL) was dropped onto the sensor area.
  • Then the sensor was suspended vertically above a titanium isopropoxide solution in a sealed dish for 6 h to create a GOD/titania sol-gel membrane, just as reported (Yu et al., 2003).
  • After forming the sol-gel membrane, a 30 µL aliquot of Nafion® solution was dropped onto the same area of the sensor, and allowed to dry in air for about 20 min.
  • Yet, the interference rejection from Nafion is imperfect; at 100uM concentrations, glucose is indistinguishable from ascorbic acid + lactate + urea.
  • Sensor drifts to 55% original performance after 4 days: figure 6
    • sensor was stored in a buffer @ 4C.
    • Probably OK for contact lenses, though.

{1173}
hide / / print
ref: -0 tags: Moshe looks automatic programming google tech talk links date: 11-07-2012 07:38 gmt revision:3 [2] [1] [0] [head]

List of links from Moshe Looks google tech talk:

{723}
hide / / print
ref: notes-0 tags: data effectiveness Norvig google statistics machine learning date: 12-06-2011 07:15 gmt revision:1 [0] [head]

The unreasonable effectiveness of data.

  • counterpoint to Eugene Wigner's "The Unreasonable effectiveness of mathematics in the natural sciences"
    • that is, math is not effective with people.
    • we should not look for elegant theories, rather embrace complexity and make use of extensive data. (google's mantra!!)
  • in 2006 google released a trillion-word corpus with all words up to 5 words long.
  • document translation and voice transcription are successful mostly because people need the services - there is demand.
    • Traditional natural language processing does not have such demand as of yet. Furthermore, it has required human-annotated data, which is expensive to produce.
  • simple models and a lot of data triumph more elaborate models based on less data.
    • for translation and any other application of ML to web data, n-gram models or linear classifiers work better than elaborate models that try to discover general rules.
  • much web data consists of individually rare but collectively frequent events.
  • because of a huge shared cognitive and cultural context, linguistic expression can be highly ambiguous and still often be understood correctly.
  • mention project halo - $10,000 per page of a chemistry textbook. (funded by DARPA)
  • ultimately suggest that there is so so much to explore now - just use unlabeled data with an unsupervised learning algorithm.

{581}
hide / / print
ref: notes-0 tags: android google date: 07-09-2008 03:33 gmt revision:1 [0] [head]

brilliant!! source: android winners

{4}
hide / / print
ref: bookmark-0 tags: google parallel_computing GFS algorithm mapping reducing date: 0-0-2006 0:0 revision:0 [head]

http://labs.google.com/papers/mapreduce.html