SLIDE: in defense of smart algorithms over hardware acceleration for largescale deep learning systems
 Modeled directly on {1505}  Scalable and sustainable deep learning via randomized hashing.
 Much emphasis on the paper on performance and tuning rather than theoretical or computational advances.
 This with explicitly wide and sparse classification datasets.
 Free parameters:
 L  number of hash tables per layer.
 K  size of hash code, in bits, per table.
 Definitely much faster  but why can't you port the sparse LSH algorithms to a GPU & increase speed further?

 Architecture follows Loss Decomposition for Fast Learning in Large Output Spaces
 Fully connected neural network with one hidden layer and a batch size of 128 for Delicious and 256 for Amazon670k.
 I don't think they performed the decomposition of the weight matrix, which requires the messagepassing iteration to set the backprop lossfunction weights. No mention of such messagepassing. (??)
 Two hash functions: Simhash and densified winner take all (DWTA). DWTA is based on the observation that if the input data is very sparse, then the hash functions will not hash well.
 Delicious200k used Simhash, K = 9, L = 50;
 Amazon670 used DWTA hash, K = 8, L = 50;
 "It should be noted that if we compute activation for s << 1 fraction of neurons in each layer on average, the fraction of weights that needs to be updated is only $s^2$ ." (Since the only weights that are updated are the intersection of active pre and post.)
 Updates are performed in a HOGWILD manner, where some overlap in weight updates (which are all computed in parallel) is tolerable for convergence.
 Updates to the hash tables, however, are not computed every SGD iteration. Instead, they are scheduled with an exponential decay term  e.g. the time between updates increases as the network converges. This is because the weight changes in the beginning are smaller than those at the end. Initial hash update is every 50 gradient updates.
 For Simhash, which uses inner product with random vectors of {+1, 0, 1} (so that you don't need multiplies, only addition and subtraction), savings can be further extended to only recompute the hashes with the changed weights. As noted above, the high level of unit sparsity makes these weight changes quadratically sparse.
 Test their hashingoptimized deep learning algorithm on Delicious200k and Amazon670k, both forms of extreme classification with a very wide output layer. Authors suggest that most of the computational expense is in this last layer, same as 'loss decomp for fast learning in large output spaces'.
