 m8ta
 {1544} hide / / print ref: -2019 tags: HSIC information bottleneck deep learning backprop gaussian kernel date: 07-21-2021 16:28 gmt revision:4     [head] The HSIC Bottleneck: Deep learning without Back-propagation In this work, the authors use a kernelized estimate of statistical independence as part of a 'information bottleneck' to set per-layer objective functions for learning useful features in a deep network. They use the HSIC, or Hilbert-schmidt independence criterion, as the independence measure. The information bottleneck was proposed by Bailek (spikes..) et al in 1999, and aims to increase the mutual information between the output and the labels while minimizing the mutual information between the output and the labels: $\frac{min}{P_{T_i} | X)} I(X; T_i) - \Beta I(T_i; Y)$ Where $T_i$ is the hidden representation at layer i (later output), $X$ is the layer input, and $Y$ are the labels. By replacing $I()$ with the HSIC, and some derivation (?), they show that $HSIC(D) = (m-1)^{-2} tr(K_X H K_y H)$ Where $D = {(x_1,y_1), ... (x_m, y_m)}$ are samples and labels, $K_{X_{ij}} = k(x_i, x_j)$ and $K_{Y_{ij}} = k(y_i, y_j)$ -- that is, it's the kernel function applied to all pairs of (vectoral) input variables. H is the centering matrix. The kernel is simply a Gaussian kernel, $k(x,y) = exp(-1/2 ||x-y||^2/\sigma^2)$ . So, if all the x and y are on average independent, then the inner-product will be mean zero, the kernel will be mean one, and after centering will lead to zero trace. If the inner product is large within the realm of the derivative of the kernel, then the HSIC will be large (and negative, i think). In practice they use three different widths for their kernel, and they also center the kernel matrices. But still, the feedback is an aggregate measure (the trace) of the product of two kernelized (a nonlinearity) outer-product spaces of similarities between inputs. it's not unimaginable that feedback networks could be doing something like this... For example, a neural network could calculate & communicate aspects of joint statistics to reward / penalize weights within a layer of a network, and this is parallelizable / per layer / adaptable to an unsupervised learning regime. Indeed, that was done almost exactly by this paper: Kernelized information bottleneck leads to biologically plausible 3-factor Hebbian learning in deep networks albeit in a much less intelligable way.