{1559} revision 0 modified: 12-24-2021 05:50 gmt

Some investigations into denoising models & their intellectual lineage:

Deep Unsupervised Learning using Nonequilibrium Thermodynamics 2015

  • Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli
  • Starting derivation of using diffusion models for training.
  • Verrry roughly, the idea is to destroy the structure in an image using diagonal Gaussian per-pixel, and train an inverse-diffusion model to remove the noise at each step. Then start with Gaussian noise and reverse-diffuse an image.
  • Diffusion can take 100s - 1000s of steps; steps are made small to preserve the assumption that the conditional probability, p(x t1|x t)N(0,I)p(x_{t-1}|x_t) \propto N(0, I)
    • The time variable here goes from 0 (uncorrupted data) to T (fully corrupted / Gaussian noise)

Generative Modeling by Estimating Gradients of the Data Distribution July 2019

  • Yang Song, Stefano Ermon

Denoising Diffusion Probabilistic Models June 2020

  • Jonathan Ho, Ajay Jain, Pieter Abbeel
  • A diffusion model that can output 'realistic' images (low FID / low log-likelihood )

Improved Denoising Diffusion Probabilistic Models Feb 2021

  • Alex Nichol, Prafulla Dhariwal
  • This is directly based on Ho 2020 and Shol-Dickstein 2015, but with tweaks
  • The objective is no longer the log-likelihood of the data given the parameters (per pixel); it's now mostly the MSE between the corrupting noise (which is known) and the estimated noise.
  • That is, the neural network model attempts, given x tx_t to estimate the noise which corrupted it, which then can be used to produce x t1x_{t-1}
    • Simpicity. Satisfying.
  • The also include a reweighted version of the log-likelihood loss, which puts more emphasis on the first few steps of noising. These steps are more important for NLL; reweighting also smooths the loss.
    • I think that, per Ho above, the simple MSE loss is sufficient to generate good images, but the reweighted LL improves the likelihood of the parameters.
  • There are some good crunchy mathematical details on how how exactly the the mean and variance of the estimated Gaussian distributions are handled -- at each noising step, you need to scale the mean down to prevent Brownian / random walk.
    • Taking these further, you can estimate an image at any point t in the forward diffusion chain. They use this fact to optimize the function approximator (a neural network; more later) using a (random but re-weighted/scheduled) t and the LL loss + simple loss.
  • Ho 2020 above treats the variance of the noising Gaussian as fixed -- that is, β \beta ; this paper improves the likelihood by adjusting the noise varaince mostly at the last steps by a ~β t~\beta_t , and then further allowing the function approximator to tune the variance (a multiplicative factor) per inverse-diffusion timestep.
    • TBH I'm still slightly foggy on how you go from estimating noise (this seems like samples, concrete) to then estimating variance (which is variational?). hmm.
  • Finally, they schedule the forward noising with a cosine^2, rather than a linear ramp. This makes the last phases of corruption more useful.
  • Because they have an explicit parameterization of the noise varaince, they can run the inverse diffusion (e.g. image generation) faster -- rather than 4000 steps, which can take afew minutes on a GPU, they can step up the variance and run it only for 50 steps and get nearly as good images.

Diffusion Models Beat GANs on Image Synthesis May 2021

  • Prafulla Dhariwal, Alex Nichol

In all of above, it seems that the inverse-diffusion function approximator is a minor player in the paper -- but of course, it's vitally important to making the system work. In some sense, this 'diffusion model' is as much a means of training the neural network as it is a (rather inefficient, compared to GANs) way of sampling from the data distribution. In Nichol & Dhariwal Feb 2021, they use a U-net convolutional network (e.g. start with few channels, downsample and double the channels until there are 128-256 channels, then upsample x2 and half the channels) including multi-headed attention. Ho 2020 used single-headed attention only at the 16x16 level. Ho 2020 in turn was based on PixelCNN++

PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications Jan 2017

  • Tim Salimans, Andrej Karpathy, Xi Chen, Diederik P. Kingma

which is an improvement to (e.g. add selt-attention layers)

Conditional Image Generation with PixelCNN Decoders

  • Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, Koray Kavukcuoglu

Most recently,

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

  • Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, Mark Chen

Added text-conditional generation + many more parameters + much more compute to yield very impressive image results + in-painting. This last effect is enabled by the fact that it's a full generative denoising probabilistic model -- you can condition on other parts of the image!