You are not authenticated, login.
text: sort by
tags: modified
type: chronology
{423} is owned by tlh24.
hide / / print
ref: -0 tags: diffusion models image generation OpenAI date: 12-24-2021 05:50 gmt revision:0 [head]

Some investigations into denoising models & their intellectual lineage:

Deep Unsupervised Learning using Nonequilibrium Thermodynamics 2015

  • Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli
  • Starting derivation of using diffusion models for training.
  • Verrry roughly, the idea is to destroy the structure in an image using diagonal Gaussian per-pixel, and train an inverse-diffusion model to remove the noise at each step. Then start with Gaussian noise and reverse-diffuse an image.
  • Diffusion can take 100s - 1000s of steps; steps are made small to preserve the assumption that the conditional probability, p(x t1|x t)N(0,I)p(x_{t-1}|x_t) \propto N(0, I)
    • The time variable here goes from 0 (uncorrupted data) to T (fully corrupted / Gaussian noise)

Generative Modeling by Estimating Gradients of the Data Distribution July 2019

  • Yang Song, Stefano Ermon

Denoising Diffusion Probabilistic Models June 2020

  • Jonathan Ho, Ajay Jain, Pieter Abbeel
  • A diffusion model that can output 'realistic' images (low FID / low log-likelihood )

Improved Denoising Diffusion Probabilistic Models Feb 2021

  • Alex Nichol, Prafulla Dhariwal
  • This is directly based on Ho 2020 and Shol-Dickstein 2015, but with tweaks
  • The objective is no longer the log-likelihood of the data given the parameters (per pixel); it's now mostly the MSE between the corrupting noise (which is known) and the estimated noise.
  • That is, the neural network model attempts, given x tx_t to estimate the noise which corrupted it, which then can be used to produce x t1x_{t-1}
    • Simpicity. Satisfying.
  • The also include a reweighted version of the log-likelihood loss, which puts more emphasis on the first few steps of noising. These steps are more important for NLL; reweighting also smooths the loss.
    • I think that, per Ho above, the simple MSE loss is sufficient to generate good images, but the reweighted LL improves the likelihood of the parameters.
  • There are some good crunchy mathematical details on how how exactly the the mean and variance of the estimated Gaussian distributions are handled -- at each noising step, you need to scale the mean down to prevent Brownian / random walk.
    • Taking these further, you can estimate an image at any point t in the forward diffusion chain. They use this fact to optimize the function approximator (a neural network; more later) using a (random but re-weighted/scheduled) t and the LL loss + simple loss.
  • Ho 2020 above treats the variance of the noising Gaussian as fixed -- that is, β \beta ; this paper improves the likelihood by adjusting the noise varaince mostly at the last steps by a ~β t~\beta_t , and then further allowing the function approximator to tune the variance (a multiplicative factor) per inverse-diffusion timestep.
    • TBH I'm still slightly foggy on how you go from estimating noise (this seems like samples, concrete) to then estimating variance (which is variational?). hmm.
  • Finally, they schedule the forward noising with a cosine^2, rather than a linear ramp. This makes the last phases of corruption more useful.
  • Because they have an explicit parameterization of the noise varaince, they can run the inverse diffusion (e.g. image generation) faster -- rather than 4000 steps, which can take afew minutes on a GPU, they can step up the variance and run it only for 50 steps and get nearly as good images.

Diffusion Models Beat GANs on Image Synthesis May 2021

  • Prafulla Dhariwal, Alex Nichol

In all of above, it seems that the inverse-diffusion function approximator is a minor player in the paper -- but of course, it's vitally important to making the system work. In some sense, this 'diffusion model' is as much a means of training the neural network as it is a (rather inefficient, compared to GANs) way of sampling from the data distribution. In Nichol & Dhariwal Feb 2021, they use a U-net convolutional network (e.g. start with few channels, downsample and double the channels until there are 128-256 channels, then upsample x2 and half the channels) including multi-headed attention. Ho 2020 used single-headed attention only at the 16x16 level. Ho 2020 in turn was based on PixelCNN++

PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications Jan 2017

  • Tim Salimans, Andrej Karpathy, Xi Chen, Diederik P. Kingma

which is an improvement to (e.g. add selt-attention layers)

Conditional Image Generation with PixelCNN Decoders

  • Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, Koray Kavukcuoglu

Most recently,

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

  • Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, Mark Chen

Added text-conditional generation + many more parameters + much more compute to yield very impressive image results + in-painting. This last effect is enabled by the fact that it's a full generative denoising probabilistic model -- you can condition on other parts of the image!

hide / / print
ref: -0 tags: image registration optimization camera calibration sewing machine date: 07-15-2016 05:04 gmt revision:20 [19] [18] [17] [16] [15] [14] [head]

Recently I was tasked with converting from image coordinates to real world coordinates from stereoscopic cameras mounted to the end-effector of a robot. The end goal was to let the user (me!) click on points in the image, and have the robot record that position & ultimately move to it.

The overall strategy is to get a set of points in both image and RW coordinates, then fit some sort of model to the measured data. I began by printing out a grid of (hopefully evenly-spaced and perpendicular) lines via a laserprinter; spacing was ~1.1 mm. This grid was manually aligned to the axes of robot motion by moving the robot along one axis & checking that the lines did not jog.

The images were modeled as a grating with quadratic phase in u,vu,v texture coordinates:

p h(u,v)=sin((a hu/1000+b hv/1000+c h)v+d hu+e hv+f h)+0.97 p_h(u,v) = sin((a_h u/1000 + b_h v/1000 + c_h)v + d_h u + e_h v + f_h) + 0.97 (1)

p v(u,v)=sin((a vu/1000+b vv/1000+c v)u+d vu+e vv+f v)+0.97 p_v(u,v) = sin((a_v u/1000 + b_v v/1000 + c_v)u + d_v u + e_v v + f_v) + 0.97 (2)

I(u,v)=16p hp v/(2+16p h 2+16p v 2) I(u,v) = 16 p_h p_v / ( \sqrt{ 2 + 16 p_h^2 + 16 p_v^2}) (3)

The 1000 was used to make the parameter search distribution more spherical; c h,c vc_h,c_v were bias terms to seed the solver; 0.97 was a duty-cycle term fit by inspection to the image data; (3) is a modified sigmoid.

I I was then optimized over the parameters using a GPU-accelerated (CUDA) nonlinear stochastic optimization:

(a h,b h,d h,e h,f h|a v,b v,d v,e v,f v)=Argmin u v(I(u,v)Img(u,v)) 2 (a_h,b_h,d_h,e_h,f_h | a_v,b_v,d_v,e_v,f_v) = Argmin \sum_u \sum_v (I(u,v) - Img(u,v))^2 (4)

Optimization was carried out by drawing parameters from a normal distribution with a diagonal covariance matrix, set by inspection, and mean iteratively set to the best solution; horizontal and vertical optimization steps were separable and carried out independently. The equation (4) was sampled 18k times, and equation (3) 34 billion times per frame. Hence the need for GPU acceleration.

This yielded a set of 10 parameters (again, c hc_h and c vc_v were bias terms and kept constant) which modeled the data (e.g. grid lines) for each of the two cameras. This process was repeated every 0.1 mm from 0 - 20 mm height (z) from the target grid, resulting in a sampled function for each of the parameters, e.g. a h(z)a_h(z) . This required 13 trillion evaluations of equation (3).

Now, the task was to use this model to generate the forward and reverse transform from image to world coordinates; I approached this by generating a data set of the grid intersections in both image and world coordinates. To start this process, the known image origin u origin| z=0,v origin| z=0u_{origin}|_{z=0},v_{origin}|_{z=0} was used to find the corresponding roots of the periodic axillary functions p h,p vp_h,p_v :

3π2+2πn h=a huv/1000+b hv 2/1000+(c h+e h)v+d hu+f h \frac{3 \pi}{ 2} + 2 \pi n_h = a_h u v/1000 + b_h v^2/1000 + (c_h + e_h)v + d_h u + f_h (5)

3π2+2πn h=a vu 2/1000+b vuv/1000+(c v+d v)u+e vv+f v \frac{3 \pi}{ 2} + 2 \pi n_h = a_v u^2/1000 + b_v u v/1000 + (c_v + d_v)u + e_v v + f_v (6)

Or ..

n h=round((a huv/1000+b hv 2/1000+(c h+e h)v+d hu+f h3π2)/(2π) n_h = round( (a_h u v/1000 + b_h v^2/1000 + (c_h + e_h)v + d_h u + f_h - \frac{3 \pi}{ 2} ) / (2 \pi ) (7)

n v=round((a vu 2/1000+b vuv/1000+(c v+d v)u+e vv+f v3π2)/(2π) n_v = round( (a_v u^2/1000 + b_v u v/1000 + (c_v + d_v)u + e_v v + f_v - \frac{3 \pi}{ 2} ) / (2 \pi) (8)

From this, we get variables n h,origin| z=0andn v,origin| z=0n_{h,origin}|_{z=0} and n_{v,origin}|_{z=0} which are the offsets to align the sine functions p h,p vp_h,p_v with the physical origin. Now, the reverse (world to image) transform was needed, for which a two-stage newton scheme was used to solve equations (7) and (8) for u,vu,v . Note that this is an equation of phase, not image intensity -- otherwise this direct method would not work!

First, the equations were linearized with three steps of (9-11) to get in the right ballpark:

u 0=640,v 0=360 u_0 = 640, v_0 = 360

n h=n h,origin| z+[30..30],n v=n v,origin| z+[20..20] n_h = n_{h,origin}|_{z} + [-30 .. 30] , n_v = n_{v,origin}|_{z} + [-20 .. 20] (9)

B i=[3π2+2πn ha hu iv i/1000b hv i 2f h 3π2+2πn va vu i 2/1000b vu iv if v] B_i = {\left[ \begin{matrix} \frac{3 \pi}{ 2} + 2 \pi n_h - a_h u_i v_i / 1000 - b_h v_i^2 - f_h \\ \frac{3 \pi}{ 2} + 2 \pi n_v - a_v u_i^2 / 1000 - b_v u_i v_i - f_v \end{matrix} \right]} (10)

A i=[d h c h+e h c v+d v e v] A_i = {\left[ \begin{matrix} d_h && c_h + e_h \\ c_v + d_v && e_v \end{matrix} \right]} and

[u i+1 v i+1]=mldivide(A i,B i) {\left[ \begin{matrix} u_{i+1} \\ v_{i+1} \end{matrix} \right]} = mldivide(A_i,B_i) (11) where mldivide is the Matlab operator.

Then three steps with the full Jacobian were made to attain accuracy:

J i=[a hv i/1000+d h a hu i/1000+2b hv i/1000+c h+e h 2a vu i/1000+b vv i/1000+c v+d v b vu i/1000+e v] J_i = {\left[ \begin{matrix} a_h v_i / 1000 + d_h && a_h u_i / 1000 + 2 b_h v_i / 1000 + c_h + e_h \\ 2 a_v u_i / 1000 + b_v v_i / 1000 + c_v + d_v && b_v u_i / 1000 + e_v \end{matrix} \right]} (12)

K i=[a hu iv i/1000+b hv i 2/1000+(c h+e h)v i+d hu i+f h3π22πn h a vu i 2/1000+b vu iv i/1000+(c v+d v)u i+e vv+f v3π22πn v] K_i = {\left[ \begin{matrix} a_h u_i v_i/1000 + b_h v_i^2/1000 + (c_h+e_h) v_i + d_h u_i + f_h - \frac{3 \pi}{ 2} - 2 \pi n_h \\ a_v u_i^2/1000 + b_v u_i v_i/1000 + (c_v+d_v) u_i + e_v v + f_v - \frac{3 \pi}{ 2} - 2 \pi n_v \end{matrix} \right]} (13)

[u i+1 v i+1]=[u i v i]J i 1K i {\left[ \begin{matrix} u_{i+1} \\ v_{i+1} \end{matrix} \right]} = {\left[ \begin{matrix} u_i \\ v_i \end{matrix} \right]} - J^{-1}_i K_i (14)

Solutions (u,v)(u,v) were verified by plugging back into equations (7) and (8) & verifying n h,n vn_h, n_v were the same. Inconsistent solutions were discarded; solutions outside the image space [0,1280),[0,720)[0, 1280),[0, 720) were also discarded. The process (10) - (14) was repeated to tile the image space with gird intersections, as indicated in (9), and this was repeated for all zz in (0..0.1..20)(0 .. 0.1 .. 20) , resulting in a large (74k points) dataset of (u,v,n h,n v,z)(u,v,n_h,n_v,z) , which was converted to full real-world coordinates based on the measured spacing of the grid lines, (u,v,x,y,z)(u,v,x,y,z) . Between individual z steps, n h,originn v,originn_{h,origin} n_{v,origin} was re-estimated to minimize (for a current zz' ):

(u origin| z+0.1u origin| z+0.1) 2+(v origin| z+0.1+v origin| z) 2 (u_{origin}|_{z' + 0.1} - u_{origin}|_{z' + 0.1})^2 + (v_{origin}|_{z' + 0.1} + v_{origin}|_{z'})^2 (15)

with grid-search, and the method of equations (9-14). This was required as the stochastic method used to find original image model parameters was agnostic to phase, and so phase (via parameter f f_{-} ) could jump between individual zz measurements (the origin did not move much between successive measurements, hence (15) fixed the jumps.)

To this dataset, a model was fit:

[u v]=A[1 x y z x 2 y 2 z 2 w 2 xy xz yz xw yw zw] {\left[ \begin{matrix} u \\ v \end{matrix} \right]} = A {\left[ \begin{matrix} 1 && x && y && z && x'^2 && y'^2 && \prime z'^2 && w^2 && x' y' && x' z' && y' z' && x' w && y' w && z' w \end{matrix} \right]} (16)

Where x=x10x' = \frac{x}{ 10} , y=y10y' = \frac{y}{ 10} , z=z10z' = \frac{z}{ 10} , and w=2020zw = \frac{ 20}{20 - z} . ww was introduced as an axillary variable to assist in perspective mapping, ala computer graphics. Likewise, x,y,zx,y,z were scaled so the quadratic nonlinearity better matched the data.

The model (16) was fit using regular linear regression over all rows of the validated dataset. This resulted in a second set of coefficients AA for a model of world coordinates to image coordinates; again, the model was inverted using Newton's method (Jacobian omitted here!). These coefficients, one set per camera, were then integrated into the C++ program for displaying video, and the inverse mapping (using closed-form matrix inversion) was used to convert mouse clicks to real-world coordinates for robot motor control. Even with the relatively poor wide-FOV cameras employed, the method is accurate to ±50μm\pm 50\mu m , and precise to ±120μm \pm 120\mu m .

hide / / print
ref: -0 tags: hinton convolutional deep networks image recognition 2012 date: 01-11-2014 20:14 gmt revision:0 [head]

ImageNet Classification with Deep Convolutional Networks

hide / / print
ref: Debarnot-2009.03 tags: sleep motor imagery practice date: 03-24-2009 15:32 gmt revision:3 [2] [1] [0] [head]

PMID-18835655[0] Sleep-related improvements in motor learning following mental practice.

  • shows that after both physical practice and mental imagery on day 1, sleep improves test performance in both when testing on day 2.


[0] Debarnot U, Creveaux T, Collet C, Gemignani A, Massarelli R, Doyon J, Guillot A, Sleep-related improvements in motor learning following mental practice.Brain Cogn 69:2, 398-405 (2009 Mar)

hide / / print
ref: Porro-1996.12 tags: motor imagery fMRI practice date: 02-19-2009 22:50 gmt revision:0 [head]

PMID-8922425 Primary Motor and Sensory Cortex Activation during Motor Performance and Motor Imagery: A Functional Magnetic Resonance Imaging Study.

  • says exactly what you might expect: that the motor cortex is active during motor imagery, and the regions active during motor performance and motor imagery are overlapping.

hide / / print
ref: bookmark-0 tags: thalamus basal ganglia neuroanatomy centromedian red nucleus images date: 0-0-2007 0:0 revision:0 [head]

http://www.neuroanatomy.wisc.edu/coro97/contents.htm --coronal sections through the thalamus, very nice!

hide / / print
ref: bookmark-0 tags: neuroanatomy pulvinar thalamus superior colliculus image gray brainstem date: 0-0-2007 0:0 revision:0 [head]

http://en.wikipedia.org/wiki/Image:Gray719.png --great, very useful!