Pay attention to MLPs
 Using bilinear / multiplicative gating + deep / wide networks, you can attain similar accuracies as Transformers on vision and masked language learning tasks! No attention needed, just a innetwork multiplicative term.
 And the math is quite straightforward. Per layer:
 $Z = \sigma(X U) ,, \hat{Z} = s(Z) ,, Y = \hat{Z} V$
 Where X is the layer input, $\sigma$ is the nonlinearity (GeLU), U is a weight matrix, $\hat{Z}$ is the spatiallygated Z, and V is another weight matrix.
 $s(Z) = Z_1 \odot (W Z_2 + b)$
 Where Z is divided into two parts along the channel dimension, $Z_1 Z_2$ . 'circleDot' is elementwise multiplication, and W is a weight matrix.

 You of course need a lot of compute; this paper has nice figures of model accuracy scaling vs. depth / number of parameters / size. I guess you can do this if you're Google.
Pretty remarkable that an industrial lab freely publishes results like this. I guess the ROI is that they get the resultant improved ideas? Or, perhaps, Google is in such a dominant position in terms of data and compute that even if they give away ideas and code, provided some of the resultant innovation returns to them, they win. The return includes trained people as well as ideas. Good for us, I guess! 
PMID29205151 Towards deep learning with segregated dendrites
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5716677/
 Much emphasis on the problem of credit assignment in biological neural networks.
 That is: given complex behavior, how do upstream neurons change to improve the task of downstream neurons?
 Or: given downstream neurons, how do upstream neurons receive â€˜creditâ€™ for informing behavior?
 I find this a very limiting framework, and is one of my chief beefs with the work.
 Spatiotemporal Bayesian structure seems like a much better axis (axes) to cast function against.
 Or, it could be segregation into â€˜signalâ€™ and â€˜errorâ€™ or â€˜figure/groundâ€™ based on hierarchical spatiotemporal statistical properties that matters ...
 ... with proper integration of nonstochastic spike timing + neoSTDP.
 This still requires some solution of the creditassignment problem, i know i know.
 Outline a spiking neuron model with zero one or two hidden layers, and a segregated apical (feedback) and basal (feedforward) dendrites, as per a layer 5 pyramidal neuron.
 The apical dendrites have plateau potentials, which are stimulated through (random) feedback weights from the output neurons.
 Output neurons are forced to onehot activation at maximum firing rate during training.

 In order to assign credit, feedforward information must be integrated separately from any feedback signals used to calculate error for synaptic updates (the error is indicated here with Î´). (B) Illustration of the segregated dendrites proposal. Rather than using a separate pathway to calculate error based on feedback, segregated dendritic compartments could receive feedback and calculate the error signals locally.
 Uses the MNIST database, naturally.
 Poisson spiking input neurons, 784, again natch.
 Derive local loss function learning rules to make the plateau potential (from the feedback weights) match the feedforward potential
 This encourages the hidden layer > output layer to approximate the inverse of the random feedback weight network  which it does! (At least, the jacobians are inverses of each other).
 The matching is performed in two phases  feedforward and feedback. This itself is not biologically implausible, just unlikely.
 Achieved moderate performance on MNIST, ~ 4%, which improved with 2 hidden layers.
 Very good, interesting scholarship on the relevant latest findings â€˜â€™in vivoâ€™â€™.
 While the model seems workable though adhoc or justso, the scholarship points to something better: use of multiple neuron subtypes to accomplish different elements (variables) in the randomfeedback credit assignment algorithm.
 These small models can be tuned to do this somewhat simple task through enough fiddling & manual (e.g. in the algorithmic space, not weight space) backpropagation of errors.
 They suggest that the early phases of learning may entail learning the feedback weights  fascinating.
 â€˜â€™Things are definitely moving forwardâ€™â€™.
