m8ta
use https for features.
text: sort by
tags: modified
type: chronology
{1508}
hide / / print
ref: -0 tags: date: 03-30-2020 02:09 gmt revision:1 [0] [head]

SLIDE: in defense of smart algorithms over hardware acceleration for large-scale deep learning systems

  • Modeled directly on {1505} - Scalable and sustainable deep learning via randomized hashing.
  • Much emphasis on the paper on performance and tuning rather than theoretical or computational advances.
    • This with explicitly wide and sparse classification datasets.
  • Free parameters:
    • L -- number of hash tables per layer.
    • K -- size of hash code, in bits, per table.
  • Definitely much faster -- but why can't you port the sparse LSH algorithms to a GPU & increase speed further?
  • Architecture follows Loss Decomposition for Fast Learning in Large Output Spaces
    • Fully connected neural network with one hidden layer and a batch size of 128 for Delicious and 256 for Amazon-670k.
    • I don't think they performed the decomposition of the weight matrix, which requires the message-passing iteration to set the backprop loss-function weights. No mention of such message-passing. (??)
  • Two hash functions: Simhash and densified winner take all (DWTA). DWTA is based on the observation that if the input data is very sparse, then the hash functions will not hash well.
    • Delicious-200k used Simhash, K = 9, L = 50;
    • Amazon-670 used DWTA hash, K = 8, L = 50;
  • "It should be noted that if we compute activation for s << 1 fraction of neurons in each layer on average, the fraction of weights that needs to be updated is only s 2s^2 ." (Since the only weights that are updated are the intersection of active pre and post.)
  • Updates are performed in a HOGWILD manner, where some overlap in weight updates (which are all computed in parallel) is tolerable for convergence.
  • Updates to the hash tables, however, are not computed every SGD iteration. Instead, they are scheduled with an exponential decay term -- e.g. the time between updates increases as the network converges. This is because the weight changes in the beginning are smaller than those at the end. Initial hash update is every 50 gradient updates.
    • For Simhash, which uses inner product with random vectors of {+1, 0, -1} (so that you don't need multiplies, only addition and subtraction), savings can be further extended to only re-compute the hashes with the changed weights. As noted above, the high level of unit sparsity makes these weight changes quadratically sparse.
  • Test their hashing-optimized deep learning algorithm on Delicious-200k and Amazon-670k, both forms of extreme classification with a very wide output layer. Authors suggest that most of the computational expense is in this last layer, same as 'loss decomp for fast learning in large output spaces'.