You are not authenticated, login.
text: sort by
tags: modified
type: chronology
hide / / print
ref: -0 tags: diffusion models image generation OpenAI date: 12-24-2021 05:50 gmt revision:0 [head]

Some investigations into denoising models & their intellectual lineage:

Deep Unsupervised Learning using Nonequilibrium Thermodynamics 2015

  • Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli
  • Starting derivation of using diffusion models for training.
  • Verrry roughly, the idea is to destroy the structure in an image using diagonal Gaussian per-pixel, and train an inverse-diffusion model to remove the noise at each step. Then start with Gaussian noise and reverse-diffuse an image.
  • Diffusion can take 100s - 1000s of steps; steps are made small to preserve the assumption that the conditional probability, p(x t1|x t)N(0,I)p(x_{t-1}|x_t) \propto N(0, I)
    • The time variable here goes from 0 (uncorrupted data) to T (fully corrupted / Gaussian noise)

Generative Modeling by Estimating Gradients of the Data Distribution July 2019

  • Yang Song, Stefano Ermon

Denoising Diffusion Probabilistic Models June 2020

  • Jonathan Ho, Ajay Jain, Pieter Abbeel
  • A diffusion model that can output 'realistic' images (low FID / low log-likelihood )

Improved Denoising Diffusion Probabilistic Models Feb 2021

  • Alex Nichol, Prafulla Dhariwal
  • This is directly based on Ho 2020 and Shol-Dickstein 2015, but with tweaks
  • The objective is no longer the log-likelihood of the data given the parameters (per pixel); it's now mostly the MSE between the corrupting noise (which is known) and the estimated noise.
  • That is, the neural network model attempts, given x tx_t to estimate the noise which corrupted it, which then can be used to produce x t1x_{t-1}
    • Simpicity. Satisfying.
  • The also include a reweighted version of the log-likelihood loss, which puts more emphasis on the first few steps of noising. These steps are more important for NLL; reweighting also smooths the loss.
    • I think that, per Ho above, the simple MSE loss is sufficient to generate good images, but the reweighted LL improves the likelihood of the parameters.
  • There are some good crunchy mathematical details on how how exactly the the mean and variance of the estimated Gaussian distributions are handled -- at each noising step, you need to scale the mean down to prevent Brownian / random walk.
    • Taking these further, you can estimate an image at any point t in the forward diffusion chain. They use this fact to optimize the function approximator (a neural network; more later) using a (random but re-weighted/scheduled) t and the LL loss + simple loss.
  • Ho 2020 above treats the variance of the noising Gaussian as fixed -- that is, β \beta ; this paper improves the likelihood by adjusting the noise varaince mostly at the last steps by a ~β t~\beta_t , and then further allowing the function approximator to tune the variance (a multiplicative factor) per inverse-diffusion timestep.
    • TBH I'm still slightly foggy on how you go from estimating noise (this seems like samples, concrete) to then estimating variance (which is variational?). hmm.
  • Finally, they schedule the forward noising with a cosine^2, rather than a linear ramp. This makes the last phases of corruption more useful.
  • Because they have an explicit parameterization of the noise varaince, they can run the inverse diffusion (e.g. image generation) faster -- rather than 4000 steps, which can take afew minutes on a GPU, they can step up the variance and run it only for 50 steps and get nearly as good images.

Diffusion Models Beat GANs on Image Synthesis May 2021

  • Prafulla Dhariwal, Alex Nichol

In all of above, it seems that the inverse-diffusion function approximator is a minor player in the paper -- but of course, it's vitally important to making the system work. In some sense, this 'diffusion model' is as much a means of training the neural network as it is a (rather inefficient, compared to GANs) way of sampling from the data distribution. In Nichol & Dhariwal Feb 2021, they use a U-net convolutional network (e.g. start with few channels, downsample and double the channels until there are 128-256 channels, then upsample x2 and half the channels) including multi-headed attention. Ho 2020 used single-headed attention only at the 16x16 level. Ho 2020 in turn was based on PixelCNN++

PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications Jan 2017

  • Tim Salimans, Andrej Karpathy, Xi Chen, Diederik P. Kingma

which is an improvement to (e.g. add selt-attention layers)

Conditional Image Generation with PixelCNN Decoders

  • Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, Koray Kavukcuoglu

Most recently,

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

  • Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, Mark Chen

Added text-conditional generation + many more parameters + much more compute to yield very impressive image results + in-painting. This last effect is enabled by the fact that it's a full generative denoising probabilistic model -- you can condition on other parts of the image!

hide / / print
ref: -0 tags: curiosity exploration forward inverse models trevor darrell date: 02-01-2019 03:42 gmt revision:1 [0] [head]

Curiosity-driven exploration by Self-supervised prediction

  • Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, Trevor Darrell
  • Key insight: “we only predict the changes in the environment that could possibly be due to actions of our agent or affect the agent, and ignore the rest”.
    • Instead of making predictions in the sensory space (e.g. pixels), we transform the sensory input into a feature space where only the information relevant to the agent is represented.
    • We learn this feature space using self-supervision -- training a neural network via a proxy inverse dynamics task -- predicting the agent’s action from the past and future sensory states.
  • We then use this inverse model to train a forward dynamics model to predict feature representation of the next state from present feature representation and action.
      • The difference between expected and actual representation serves as a reward signal for the agent.
  • Quasi actor-critic / adversarial agent design, again.
  • Used the asynchronous advantage actor critic policy gradient method (Mnih et al 2016 Asynchronous Methods for Deep Reinforcement Learning).
  • Compare with variational information maximization (VIME) trained with TRPO (Trust region policy optimization) which is “more sample efficient than A3C but takes more wall time”.
  • References / concurrent work: Several methods propose improving data efficiency of RL algorithms using self-supervised prediction based auxiliary tasks (Jaderberg et al., 2017; Shelhamer et al., 2017).
  • An interesting direction for future research is to use the learned exploration behavior / skill as a motor primitive / low level policy in a more complex, hierarchical system. For example, the skill of walking along corridors could be used as part of a navigation system.

hide / / print
ref: tlh24-2011 tags: motor learning models BMI date: 01-06-2012 00:19 gmt revision:1 [0] [head]

Experiment: you have a key. You want that key to learn to control a BMI, but you do not want the BMI to learn how the key does things, as

  1. That is not applicable for when you don't have training data - amputees, parapalegics.
  2. That does not tell much about motor learning, which is what we are interested in.

Given this, I propose a very simple groupweight: one axis is controlled by the summed action of a certain population of neurons, the other by a second, disjoint, population; a third population serves as control. The task of the key is to figure out what does what: how does the firing of a given unit translate to movement (forward model). Then the task during actual behavior is to invert this: given movement end, what sequence of firings should be generated? I assume, for now, that the brain has inbuilt mechanisms for inverting models (not that it isn't incredibly interesting -- and I'll venture a guess that it's related to replay, perhaps backwards replay of events). This leaves us with the task of inferring the tool-model from behavior, a task that can be done now with our modern (though here-mentioned quite simple) machine learning algorithms. Specifically, it can be done through supervised learning: we know the input (neural firing rates) and the output (cursor motion), and need to learn the transform between them. I can think of many ways of doing this on a computer:

  1. Linear regression -- This is obvious given the problem statement and knowledge that the model is inherently linear and separable (no multiplication factors between the input vectors). n matlab, you'd just do mldivide (backslash opeartor) -- but but! this requires storing all behavior to date. Does the brain do this? I doubt it, but this model, for a linear BMI, is optimal. (You could extend it to be Bayesian if you want confidence intervals -- but this won't make it faster).
  2. Gradient descent -- During online performance, you (or the brain) adjusts the estimates of the weights per neuron to minimize error between observed behavior and estimated behavior (the estimated behavior would constitute a forward model..) This is just LMS; it works, but has a exponential convergence and may get stuck in local minima. This model will make predictions on which neurons change relevance in the behavior (more needed for acquiring reward) based on continuous-time updates.
  3. Batched Gradient descent -- Hypothetically, one could bolster the learning rate by running batches of data multiple times through a gradient descent algorithm. The brain very well could offline (sleep), and we can observe this. Such a mechanism would improve performance after sleep, which has been observed behaviorally in people (and primates?).
  4. Gated Gradient Descent -- This is halfway between reinforcement learning and gradient descent. Basically, the brain only updates weights when something of motivational / sensory salience occurs, e.g. juice reward. It differs from raw reinforcement learning in that there is still multiplication between sensory and motor data + subsequent derivative.
  5. Reinforcement learning -- Neurons are 'rewarded' at the instant juice is delivered; they adjust their behavior based on behavioral context (a target), which presumably (given how long we train our keys), is present in the brain at the same time the cursor enters the target. Sensory data and model-building are largely absent.

{i need to think more about model-building, model inversion, and songbird learning?}

hide / / print
ref: -0 tags: Todorov motor control models 2000 date: 12-22-2011 21:18 gmt revision:3 [2] [1] [0] [head]

PMID-10725930 Direct cortical control of muscle activation in voluntary arm movements: a model.

  • Argues that the observed high-level control of parameters (movement direction) is inconsistent with demonstrated low-level control (control of individual muscles / muscle groups, as revealed by STA [5] or force production [3]), but this inconsistency is false: the principle of low level control is correct, and high level control appears due to properties of the musculoskeletal system.
  • "Yet the same cells that encode hand velocity in movement tasks can also encode the forces exerted against external objects in both movement and isometric tasks [9,10].
  • The following other correlations have been observed:
    • arm position [11]
    • acceleration [12]
    • movement preparation [13]
    • target position [14]
    • distance to target [15]
    • overall trajectory [16]
    • muscle coactivation [17]
    • serial order [18]
    • visual target position [19]
    • joint configuration [20]
    • instantaneous movement curvature [7]
    • time from movement onset [15]
  • although these models can fit the data well, they leave a crucial question unanswered, namely, how such a mixed signal can be useful for generating motor behavior.
    • What? No! The diversity of voices gives rise to robust, dynamic computation. I think this is what Miguel has written about, will need to find a reference.
  • Anyway, all the motor parameters are related by the laws of physics -- the actual dimensionality of real reaches is relatively low.
  • His model: muscle activity simply reflects M1 PTN activity.
  • If you include real muscle parameters, a lot of the observed correlations make sense: muscle force depends not only on activation, but also on muscle length and rate of change of length.
  • In this scientific problem, the output (motor behavior) specified by the motor task is easily measured, and the input (M1 firing) must be explained.
    • Due to the many-to-one mapping, there is a large null-space of the inverse transform, so individual neurons cannot be predicted. Hence focus on population vector average.
  • Cosine tuning is the only activation pattern that minimizes neuromotor noise (derived in methods, Parseval's theorem)). Hence he uses force, velocity, and displacement tuning for his M1 cells.
  • Activity of M1 cells is constrained in endpoint space, hence depends only on behavioral parameters.
    • The muscles were "integrated out".
  • Using his equation, it is clear that for an isometric task, M1 activity is cosine tuned to force direction and magnitude -- x(t) is constant.
  • For hand kinematics in the physiological range with an experimentally measured inertia-to-damping ratio, the damping compensation signal dominates the acceleration signal.
    • Hence population x˙(t)\propto \dot x(t)
    • Muscle damping is asymmetric: predominant during shortening.
  • The population vector ... is equal not to the movement direction or velocity, but instead to the particular sum of position, velocity, acceleration, and force signals in eq. 1
  • PV reconstruction fails when movement and force direction are varied independently. [28]
  • Fig 4. Schwartz' drawing task -- {951} -- and shows how curvature, naturalistic velocity profiles, the resultant accelerations, and leading neuronal firing interact to distort the decoded PV.
    • Explains why, when assuming PV tuning, there seems to be variable M1-to-movement delay. At high curvature PV tuning can apprently lag movement. Impossible!
  • Fig 5 reproduces [21]
    • Mean firing rate (mfr, used to derive the poisson process spike times) and r^2 based classification remarkably different -- smoothing + square root biases toward finding direction-tuned cells.
    • Plus, as P, V, and A are all linearly related, a sum of the 3 is closer to D than any of the three.
    • "Such biases raise the important question of how one can determine what an individual neuron controls"
  • PV reversals occur when the force/acceleration term exceeds the velocity scaling term -- which is 'equivalent' to the triphasic burst pattern observed in EMG. Ergo monkeys should be trained to make faster movements.
  • The structure of your model -- for example firingrate=b 0+b xX+b yY+b mMfiringrate = b_0 + b_x X + b_y Y + b_m M biases analysis for direction, not magnitude; correct model is firingrate=b 0+b xmXM+b ymYM firingrate = b_0 + b_{xm}XM + b_{ym}YM -- multiplicative.
  • "Most of these puzzling phenomena arise from the feedforward control of muscle viscoelasticity."
  • Implicit assumption is that for the simple, overtrained, unperturbed movements typically studied, feedforward neural control is quite accurate. When you get spinal reflexes involved things may change. Likewise for projections from the red nucleus.

hide / / print
ref: bookmark-0 tags: spiking neuron models learning SRM spike response model date: 0-0-2006 0:0 revision:0 [head]