Scalable and sustainable deep learning via randomized hashing
 Central idea: replace dropout, adaptive dropout, or winnertakeall with a fast (sublinear time) hash based selection of active nodes based on approximate MIPS (maximum inner product search) using asymmetric localitysensitive hashing.
 This avoids a lot of the expensive innerproduct multiplyaccumulate work & energy associated with nodes that will either be completely off due to the ReLU or other nonlinearity  or just not important for the algorithm + current input.
 The result shows that you don't need very many neurons active in a given layer for successful training.
 C.f: adaptive dropout adaptively chooses the nodes based on their activations. A few nodes are sampled from the network probabalistically based on the node activations dependent on their current input.
 Adaptive dropouts demonstrate better performance than vanilla dropout [44]
 It is possible to drop significantly more nodes adaptively than without while retaining superior performance.
 WTA is an extreme form of adaptive dropout that uses minibatch statistics to enforce a sparsity constraint. [28] {1507} Winner take all autoencoders
 Our approach uses the insight that selecting a very sparse set of hidden nodes with the highest activations can be reformulated as dynamic approximate query processing, solvable with LSH.
 LSH can be sublinear time; normal processing involves the inner product.
 LSH maps similar vectors into the same bucket with high probability. That is, it maps vectors into integers (bucket number)
 Similar approach: Hashed nets [6], which aimed to decrease the number of parameters in a network by using a universal random hash function to tie weights. Compressing neural networks with the Hashing trick
 "HashedNets uses a lowcost hash function to randomly group connection weights into hash buckets, and all connections within the same hash bucket share a single parameter value."
 Ref [38] shows how asymmetric hash functions allow LSH to be converted to a sublinear time algorithm for maximum inner product search (MIPS).
 Used multiprobe LSH: rather than having a large number of hash tables (L) which increases hash time and memory use, they probe closeby buckets in the hash tables. That is, they probe bucket at B_j(Q) and those for slightly perturbed query Q. See ref [26].
 See reference [2] for theory...
 Following ref [42], use K randomized hash functions to generate the K data bits per vector. Each bit is the sign of the asymmetric random projection. Buckets contain a pointer to the node (neuron); only active buckets are kept around.
 The K hash functions serve to increase the precision of the fingerprint  found nodes are more expected to be active.
 Have L hash tables for each hidden layer; these are used to increase the probability of finding useful / active nodes due to the randomness of the hash function.
 Hash is asymmetric in the sense that the query and collection data are hashed independently.
 In every layer during SGD, compute K x L hashes of the input, probe about 10 L buckets, and take their union. Experiments: K = 6 and L = 5.
 See ref [30] where authors show around 500x reduction in computations for image search following different algorithmic and systems choices. Capsule: a camera based positioning system using learning {1506}
 Use relatively small test data sets  MNIST 8M, NORB, Convex, Rectangles  each resized to have smallish input vectors.


 Really want more analysis of what exactly is going on here  what happens when you change the hashing function, for example? How much is the training dependent on suitable ROC or precision/recall on the activation?
 For example, they could have calculated the actual real activation & WTA selection, and compared it to the results from the hash function; how correlated are they?
