{1559} revision 0 modified: 12-24-2021 05:50 gmt Some investigations into denoising models & their intellectual lineage: Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli Starting derivation of using diffusion models for training. Verrry roughly, the idea is to destroy the structure in an image using diagonal Gaussian per-pixel, and train an inverse-diffusion model to remove the noise at each step. Then start with Gaussian noise and reverse-diffuse an image. Diffusion can take 100s - 1000s of steps; steps are made small to preserve the assumption that the conditional probability, $p(x_{t-1}|x_t) \propto N(0, I)$ The time variable here goes from 0 (uncorrupted data) to T (fully corrupted / Gaussian noise) Yang Song, Stefano Ermon Jonathan Ho, Ajay Jain, Pieter Abbeel A diffusion model that can output 'realistic' images (low FID / low log-likelihood ) Alex Nichol, Prafulla Dhariwal This is directly based on Ho 2020 and Shol-Dickstein 2015, but with tweaks The objective is no longer the log-likelihood of the data given the parameters (per pixel); it's now mostly the MSE between the corrupting noise (which is known) and the estimated noise. That is, the neural network model attempts, given $x_t$ to estimate the noise which corrupted it, which then can be used to produce $x_{t-1}$ Simpicity. Satisfying. The also include a reweighted version of the log-likelihood loss, which puts more emphasis on the first few steps of noising. These steps are more important for NLL; reweighting also smooths the loss. I think that, per Ho above, the simple MSE loss is sufficient to generate good images, but the reweighted LL improves the likelihood of the parameters. There are some good crunchy mathematical details on how how exactly the the mean and variance of the estimated Gaussian distributions are handled -- at each noising step, you need to scale the mean down to prevent Brownian / random walk. Taking these further, you can estimate an image at any point t in the forward diffusion chain. They use this fact to optimize the function approximator (a neural network; more later) using a (random but re-weighted/scheduled) t and the LL loss + simple loss. Ho 2020 above treats the variance of the noising Gaussian as fixed -- that is, $\beta$ ; this paper improves the likelihood by adjusting the noise varaince mostly at the last steps by a $~\beta_t$ , and then further allowing the function approximator to tune the variance (a multiplicative factor) per inverse-diffusion timestep. TBH I'm still slightly foggy on how you go from estimating noise (this seems like samples, concrete) to then estimating variance (which is variational?). hmm. Finally, they schedule the forward noising with a cosine^2, rather than a linear ramp. This makes the last phases of corruption more useful. Because they have an explicit parameterization of the noise varaince, they can run the inverse diffusion (e.g. image generation) faster -- rather than 4000 steps, which can take afew minutes on a GPU, they can step up the variance and run it only for 50 steps and get nearly as good images. Prafulla Dhariwal, Alex Nichol In all of above, it seems that the inverse-diffusion function approximator is a minor player in the paper -- but of course, it's vitally important to making the system work. In some sense, this 'diffusion model' is as much a means of training the neural network as it is a (rather inefficient, compared to GANs) way of sampling from the data distribution. In Nichol & Dhariwal Feb 2021, they use a U-net convolutional network (e.g. start with few channels, downsample and double the channels until there are 128-256 channels, then upsample x2 and half the channels) including multi-headed attention. Ho 2020 used single-headed attention only at the 16x16 level. Ho 2020 in turn was based on PixelCNN++ Tim Salimans, Andrej Karpathy, Xi Chen, Diederik P. Kingma which is an improvement to (e.g. add selt-attention layers) Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, Koray Kavukcuoglu Most recently, Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, Mark Chen Added text-conditional generation + many more parameters + much more compute to yield very impressive image results + in-painting. This last effect is enabled by the fact that it's a full generative denoising probabilistic model -- you can condition on other parts of the image!