{623} revision 5 modified: 01-03-2012 02:31 gmt

Reinforcement learning in the cortex (a web scour/crawl):

  • http://www.springerlink.com/content/v211201413228x34/
    • short/long interspike intervals via pain reinforcement in immobilized rabbits.
  • PMID-3748636 Increased regularity of activity of cortical neurons in learning due to disinhibitory effect of reinforcement.
    • more rabbit shocking.
  • http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6T0F-3S1PT00-P
    • applied glutamate & noradrenaline; both responses are complex.
  • Reinforcement learning in populations of spiking neurons
    • the result: reinforcement learning can function effectively in large populations of neurons if there is a trace of the population activity in addition to the reinforcement signal. this trace must be per-synapes or perhaps per-neuron (as has been anticipated for some time). very important result, helps with the 'specificity' problem.
    • in human terms, the standard reinforcement learning approach is analogous to having a class of students write an exam and being informed by the teacher on the next day whether the majority of students passed or not.
    • this learning method is slow and achieves limited fidelity; in contrast, behavioral reinforcement learning can be reliable and fast. (perhaps this is a result of already-existing maps and or activity in the cortex?)
    • reinforcement learning is almost the opposite of backpropagation, in that in backprop, a error signal is computed per neuron, while in reinforcement learning the error is only computed for the entire system. They posit that there must be a middle ground (need something less than one neuron to compute the training/error signal per neuron, othewise the system would not be very efficient...)
    • points out a good if obvious point: to learn from trial and error different responses to a given stimulus must be explored, and, for this, randomness in the neural activities provides a convenient mechanism.
    • they use the running mean as an eligibility trace per synapse. then change in weight = eta * eligibility trace(t), evaluated at the ends of trials.
    • implemented an asymmetric rule that updates the synapses only slightly if the output is reliable and correct.
    • also needed a population signal or fed-back version of the previous neural behavior. Then individual reinforcement is a product of the reinforcement signal * the population signal * the eligibility trace (the last per synapse). Roughly, if the population signal is different than the eligability trace, and the behavior is wrong, then that synapse should be reinforced. and vice-versa.
  • PMID-17444757 Reinforcement learning through modulation of spike-timing-dependent synaptic plasticity.
    • seems to give about the same result as above, except with STDP: reinforcement-modulated STDP with an eligibility trace stored at each synapse permits learning even if a reward signal is delayed.
    • network can learn XOR problem with firing-rate or temporally coded input.
    • they want someone to look for reward-moduled STDP. paper came out June 2007.
  • PMID: Metaplasticity: the plasticity of synaptic plasticity (1996, Mark Bear)
    • there is such thing as metaplasticity! (plasticity of plasticity, or control over how effective NMDAR are..)
    • he has several other papers on this topic after this..
  • PMID-2682404 Reward or reinforcement: what's the difference? (1989)
    • reward = certain environmental stimuli have the effect of eliciting approach responses. ventral striatum / nucleus accumbens is instrumental for this.
    • reinforcement = the tendency of certain stimuli to strengthen stimulus-response tendencies. dorsolateral striatum is used here.
  • PMID-9463469 Rapid plasticity of human cortical movement representation induced by practice.
  • used TMS to evoke isolated and directionally consistent thumb movements.
  • then asked the volunteers to practice moving their thumbs in an opposite direction
  • after 5-30 minutes of practice, then TMS evoked a response in the practiced direction. wow! this may be short-term memory or the first step in skill learning.
  • PMID-12736341 Learning input correlations through nonlinear temporally asymmetric Hebbian plasticity.
    • temporally asymmetric plasticity is apparently required for a stable network (aka no epilepsy?), and can be optimized to represent the temporal structure of input correlations.