 m8ta
 {1546} hide / / print ref: -1992 tags: Linsker infomax Hebbian anti-hebbian linear perceptron unsupervised learning date: 08-04-2021 00:20 gmt revision:2   [head] Ralph Linsker, 1992. A development upon {1545} -- this time with lateral inhibition trained through noise-contrast and anti-Hebbian plasticity. {1545} does not perfectly maximize the mutual information between the input and output -- this allegedly requires the inverse of the covariance matrix, $Q$ . As before, infomax principles; maximize mutual information $MI \propto H(Z) - H(Z | S)$ where Z is the network output and S is the signal input. (note: minimize the conditional entropy of output given the input). For a gaussian variable, $H = \frac{ 1}{ 2} ln det Q$ where Q is the covariance matrix. In this case $Q = E|Z Z^T |$ since $Z = C(S,N)$ where C are the weights, S is the signal, and N is the noise, $Q = C q C^T + r$ where q is the covariance matrix of input noise and r is the cov.mtx. of the output noise. (somewhat confusing): $\delta H / \delta C = Q^{-1}Cq$ because .. the derivative of the determinant is complicated. Check the appendix for the derivation. $ln det Q = Tr ln Q$ and $dH = 1/2 d(Tr ln Q) = 1/2 Tr( Q^-1 dQ )$ -- this holds for positive semidefinite matrices like Q. From this he comes up with a set of rules whereby feedforward weights are trained in a Hebbian fashion, but based on activity after lateral activation. The lateral activation has a weight matrix $F = I - \alpha Q$ (again Q is the cov.mtx. of Z). If $y(0) = Y; y(t+1) = Y + Fy(t)$ , where Y is the feed-forward activation, then $\alpha y(\inf) = Q^{-1}Y$ . This checks out: x = randn(1000, 10); Q = x' * x; a = 0.001; Y = randn(10, 1); y = zeros(10, 1); for i = 1:1000 y = Y + (eye(10) - a*Q)*y; end y - pinv(Q)*Y / a % should be zero.  This recursive definition is from Jacobi. $\alpha y(\inf) = \alpha \Sigma_{t=0}^{\inf}F^tY = \alpha(I - F)^{-1} Y = Q^{-1}Y$ . Still, you need to estimate Q through a running-average, $\Delta Q = \frac{ 1}{M}( Y_n Y_m + r_{nm} - Q_{NM} )$ and since $F = I - \alpha Q$ , F is formed via anti-hebbian terms. To this is added a 'sensing' learning and 'noise' unlearning phase -- one optimizes $H(Z)$ , the other minimizes $H(Z|S)$ . Everything is then applied, similar to before, to a gaussian-filtered one-dimensional white-noise stimuli. He shows this results in bandpass filter behavior -- quite weak sauce in an era where ML papers are expected to test on five or so datasets. Even if this was 1992 (nearly forty years ago!), it would have been nice to see this applied to a more realistic dataset; perhaps some of the following papers? Olshausen & Field came out in 1996 -- but they applied their algorithm to real images. In both Olshausen & this work, no affordances are made for multiple layers. There have to be solutions out there... {1545} hide / / print ref: -1988 tags: Linsker infomax linear neural network hebbian learning unsupervised date: 08-03-2021 06:12 gmt revision:2   [head] Ralph Linsker, 1988. One of the first (verbose, slightly diffuse) investigations of the properties of linear projection neurons (e.g. dot-product; no non-linearity) to express useful tuning functions. ''Useful' is here information-preserving, in the face of noise or dimensional bottlenecks (like PCA). Starts with Hebbian learning functions, and shows that this + white-noise sensory input + some local topology, you can get simple and complex visual cell responses. Ralph notes that neurons in primate visual cortex are tuned in utero -- prior real-world visual experience! Wow. (Who did these studies?) This is a very minimalistic starting point; there isn't even structured stimuli (!) Single neuron (and later, multiple neurons) are purely feed-forward; author cautions that a lack of feedback is not biologically realistic. Also note that this was back in the Motorola 680x0 days ... computers were not that powerful (but certainly could handle more than 1-2 neurons!) Linear algebra shows that Hebbian synapses cause a linear layer to learn the covariance function of their inputs, $Q$ , with no dependence on the actual layer activity. When looked at in terms of an energy function, this is equivalent to gradient descent to maximize the layer-output variance. He also hits on: Hopfield networks, PCA, Oja's constrained Hebbian rule $\delta w_i \propto < L_2(L_1 - L_2 w_i) >$ (that is, a quadratic constraint on the weight to make $\Sigma w^2 \sim 1$ ) Optimal linear reconstruction in the presence of noise Mutual information between layer input and output (I found this to be a bit hand-wavey) Yet he notes critically: "but it is not true that maximum information rate and maximum activity variance coincide when the probability distribution of signals is arbitrary". Indeed. The world is characterized by very non-Gaussian structured sensory stimuli. Redundancy and diversity in 2-neuron coding model. Role of infomax in maximizing the determinant of the weight matrix, sorta. One may critically challenge the infomax idea: we very much need to (and do) throw away spurious or irrelevant information in our sensory streams; what upper layers 'care about' when making decisions is certainly relevant to the lower layers. This credit-assignment is neatly solved by backprop, and there are a number 'biologically plausible' means of performing it, but both this and infomax are maybe avoiding the problem. What might the upper layers really care about? Likely 'care about' is an emergent property of the interacting local learning rules and network structure. Can you search directly in these domains, within biological limits, and motivated by statistical reality, to find unsupervised-learning networks? You'll still need a way to rank the networks, hence an objective 'care about' function. Sigh. Either way, I don't per se put a lot of weight in the infomax principle. It could be useful, but is only part of the story. Otherwise Linsker's discussion is accessible, lucid, and prescient. Lol. {1454} hide / / print ref: -2011 tags: Andrew Ng high level unsupervised autoencoders date: 03-15-2019 06:09 gmt revision:7       [head] Quoc V. Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeff Dean, Andrew Y. Ng Input data 10M random 200x200 frames from youtube. Each video contributes only one frame. Used local receptive fields, to reduce the communication requirements. 1000 computers, 16 cores each, 3 days. "Strongly influenced by" Olshausen & Field {1448} -- but this is limited to a shallow architecture. Lee et al 2008 show that stacked RBMs can model simple functions of the cortex. Lee et al 2009 show that convolutonal DBN trained on faces can learn a face detector. Their architecture: sparse deep autoencoder with Local receptive fields: each feature of the autoencoder can connect to only a small region of the lower layer (e.g. non-convolutional) Purely linear layer. More biologically plausible & allows the learning of more invariances other than translational invariances (Le et al 2010). No weight sharing means the network is extra large == 1 billion weights. Still, the human visual cortex is about a million times larger in neurons and synapses. L2 pooling (Hyvarinen et al 2009) which allows the learning of invariant features. E.g. this is the square root of the sum of the squares of its inputs. Square root nonlinearity. Local contrast normalization -- subtractive and divisive (Jarrett et al 2009) Encoding weights $W_1$ and deconding weights $W_2$ are adjusted to minimize the reconstruction error, penalized by 0.1 * the sparse pooling layer activation. Latter term encourages the network to find invariances. $minimize(W_1, W_2)$ $\sum_{i=1}^m {({ ||W_2 W_1^T x^{(i)} - x^{(i)} ||^2_2 + \lambda \sum_{j=1}^k{ \sqrt{\epsilon + H_j(W_1^T x^{(i)})^2}} })}$ $H_j$ are the weights to the j-th pooling element, $\lambda = 0.1$ ; m examples; k pooling units. This is also known as reconstruction Topographic Independent Component Analysis. Weights are updated through asynchronous SGD. Minibatch size 100. Note deeper autoencoders don't fare consistently better.