Local synaptic learning rules suffice to maximize mutual information in a linear network
 Ralph Linsker, 1992.
 A development upon {1545}  this time with lateral inhibition trained through noisecontrast and antiHebbian plasticity.
 {1545} does not perfectly maximize the mutual information between the input and output  this allegedly requires the inverse of the covariance matrix, $Q$ .
 As before, infomax principles; maximize mutual information $MI \propto H(Z)  H(Z  S)$ where Z is the network output and S is the signal input. (note: minimize the conditional entropy of output given the input).
 For a gaussian variable, $H = \frac{ 1}{ 2} ln det Q$ where Q is the covariance matrix. In this case $Q = EZ Z^T $
 since $Z = C(S,N)$ where C are the weights, S is the signal, and N is the noise, $Q = C q C^T + r$ where q is the covariance matrix of input noise and r is the cov.mtx. of the output noise.
 (somewhat confusing): $\delta H / \delta C = Q^{1}Cq$
 because .. the derivative of the determinant is complicated.
 Check the appendix for the derivation. $ln det Q = Tr ln Q$ and $dH = 1/2 d(Tr ln Q) = 1/2 Tr( Q^1 dQ )$  this holds for positive semidefinite matrices like Q.
 From this he comes up with a set of rules whereby feedforward weights are trained in a Hebbian fashion, but based on activity after lateral activation.
 The lateral activation has a weight matrix $F = I  \alpha Q$ (again Q is the cov.mtx. of Z). If $y(0) = Y; y(t+1) = Y + Fy(t)$ , where Y is the feedforward activation, then $\alpha y(\inf) = Q^{1}Y$ . This checks out:
x = randn(1000, 10);
Q = x' * x;
a = 0.001;
Y = randn(10, 1);
y = zeros(10, 1);
for i = 1:1000
y = Y + (eye(10)  a*Q)*y;
end
y  pinv(Q)*Y / a % should be zero.
 This recursive definition is from Jacobi. $\alpha y(\inf) = \alpha \Sigma_{t=0}^{\inf}F^tY = \alpha(I  F)^{1} Y = Q^{1}Y$ .
 Still, you need to estimate Q through a runningaverage, $\Delta Q = \frac{ 1}{M}( Y_n Y_m + r_{nm}  Q_{NM} )$ and since $F = I  \alpha Q$ , F is formed via antihebbian terms.
To this is added a 'sensing' learning and 'noise' unlearning phase  one optimizes $H(Z)$ , the other minimizes $H(ZS)$ . Everything is then applied, similar to before, to a gaussianfiltered onedimensional whitenoise stimuli. He shows this results in bandpass filter behavior  quite weak sauce in an era where ML papers are expected to test on five or so datasets. Even if this was 1992 (nearly forty years ago!), it would have been nice to see this applied to a more realistic dataset; perhaps some of the following papers? Olshausen & Field came out in 1996  but they applied their algorithm to real images.
In both Olshausen & this work, no affordances are made for multiple layers. There have to be solutions out there...
