use https for features.
text: sort by
tags: modified
type: chronology
hide / / print
ref: -0 tags: asymmetric locality sensitive hash maximum inner product search sparsity date: 03-30-2020 02:17 gmt revision:5 [4] [3] [2] [1] [0] [head]

Improved asymmetric locality sensitive hashing for maximum inner product search

  • Like many other papers, this one is based on a long lineage of locality-sensitive hashing papers.
  • Key innovation, in [23] The power of asymmetry in binary hashing, was the development of asymmetric hashing -- the hash function of the query is different than the hash function used for storage. Roughly, this allows additional degrees of freedom since the similarity-function is (in the non-normalized case) non-symmetric.
    • For example, take query Q = [1 1] with keys A = [1 -1] and B = [3 3]. The nearest neighbor is A (distance 2), whereas the maximum inner product is B (inner product 6).
    • Alternately: self-inner product for Q and A is 2, whereas for B it's 18. Self-similarity is not the highest with inner products.
    • Norm of the query does not have an effect on the arg max of the search, though. Hence, for the paper assume that the query has been normalized for MIPS.
  • In this paper instead they convert MIPS into approximate cosine similarity search (which is like normalized MIPS), which can be efficiently solved with signed random projections.
  • (Established): LSH-L2 distance:
    • Sample a random vector a, iid normal N(0,1)
    • Sample a random normal b between 0 and r
      • r is the window size / radius (free parameters?)
    • Hash function is then the floor of the inner product of the vector a and input x + b divided by the radius.
      • I'm not sure about how the floor op is converted to bits of the actual hash -- ?
  • (Established): LSH-correlation, signed random projections h signh^{sign} :
    • Hash is the sign of the inner product of the input vector and a uniform random vector a.
    • This is a two-bit random projection [13][14].
  • (New) Asymmetric-LSH-L2:
    • P(x)=[x;||x|| 2 2;||x|| 2 4;....;||x|| 2 2 m]P(x) = [x;||x||^2_2; ||x||^4_2; .... ; ||x||^{2^m}_2] -- this is the pre-processing hashing of the 'keys'.
      • Requires that then norm of these keys, {||x||}_2 < U < 1$$
      • m3 m \geq 3
    • Q(x)=[x;1/2;1/2;...;1/2]Q(x) = [x;1/2; 1/2; ... ; 1/2] -- hashing of the queries.
    • See the mathematical explanation in the paper, but roughly "transformations P and Q, when normas are less than 1, provide correction to the L2 distance ||Q(p)P(x i)|| 2||Q(p) - P(x_i)||_2 , making in rank correlate with un-normalized inner product."
  • They then change the augmentation to:
    • P(x)=[x;1/2||x|| 2 2;1/2||x|| 2 4;...;1/2||x|| 2 2 m]P(x) = [x; 1/2 - ||x||^2_2; 1/2 - ||x||^4_2; ... ; 1/2 - ||x||^{2^m}_2]
    • Q(x)=[x;0;...;0]Q(x) = [x; 0; ...; 0]
    • This allows use of signed nearest-neighbor search to be used in the MIPS problem. (e.g. the hash is the sign of P and Q, per above; I assume this is still a 2-bit operation?)
  • Then the expand the U,M compromise function ρ\rho to allow for non-normalized queries. U depends on m and c (m is the codeword extension, and c is the ratio between o-target and off-target hash hits.
  • Tested on Movielens and Netflix databases, this using SVD preprocessing on the user-item matrix (full rank matrix indicating every user rating on every movie (mostly zeros!)) to get at the latent vectors.
  • In the above plots, recall (hah) that precision is the number of true positives / number of false positives as the number of draws k increases; recall is the number of true positives / number of draws k.
    • Clearly, the curve bends up and to the right when there are a lot of hash tables K.
    • Example datapoint: 50% precision at 40% recall, top 5. So on average you get 2 correct hits in 4 draws. Or: 40% precision, 20% recall, top 10: 2 hits in 5 draws. 20/40: 4 hits in 20 draws. (hit: correctly within the top-N)
    • So ... it's not that great.

Use case: Capsule: a camera based positioning system using learning
  • Uses 512 SIFT features as keys and queries to LSH. Hashing is computed via sparse addition / subtraction algorithm, with K bits per hash table (not quite random projections) and L hash tables. K = 22 and L = 24. ~ 1000 training images.
  • Best matching image is used as the location of the current image.

hide / / print
ref: -2016 tags: locality sensitive hash deep learning regularization date: 03-30-2020 02:07 gmt revision:5 [4] [3] [2] [1] [0] [head]

Scalable and sustainable deep learning via randomized hashing

  • Central idea: replace dropout, adaptive dropout, or winner-take-all with a fast (sublinear time) hash based selection of active nodes based on approximate MIPS (maximum inner product search) using asymmetric locality-sensitive hashing.
    • This avoids a lot of the expensive inner-product multiply-accumulate work & energy associated with nodes that will either be completely off due to the ReLU or other nonlinearity -- or just not important for the algorithm + current input.
    • The result shows that you don't need very many neurons active in a given layer for successful training.
  • C.f: adaptive dropout adaptively chooses the nodes based on their activations. A few nodes are sampled from the network probabalistically based on the node activations dependent on their current input.
    • Adaptive dropouts demonstrate better performance than vanilla dropout [44]
    • It is possible to drop significantly more nodes adaptively than without while retaining superior performance.
  • WTA is an extreme form of adaptive dropout that uses mini-batch statistics to enforce a sparsity constraint. [28] {1507} Winner take all autoencoders
  • Our approach uses the insight that selecting a very sparse set of hidden nodes with the highest activations can be reformulated as dynamic approximate query processing, solvable with LSH.
    • LSH can be sub-linear time; normal processing involves the inner product.
    • LSH maps similar vectors into the same bucket with high probability. That is, it maps vectors into integers (bucket number)
  • Similar approach: Hashed nets [6], which aimed to decrease the number of parameters in a network by using a universal random hash function to tie weights. Compressing neural networks with the Hashing trick
    • "HashedNets uses a low-cost hash function to randomly group connection weights into hash buckets, and all connections within the same hash bucket share a single parameter value."
  • Ref [38] shows how asymmetric hash functions allow LSH to be converted to a sub-linear time algorithm for maximum inner product search (MIPS).
  • Used multi-probe LSH: rather than having a large number of hash tables (L) which increases hash time and memory use, they probe close-by buckets in the hash tables. That is, they probe bucket at B_j(Q) and those for slightly perturbed query Q. See ref [26].
  • See reference [2] for theory...
  • Following ref [42], use K randomized hash functions to generate the K data bits per vector. Each bit is the sign of the asymmetric random projection. Buckets contain a pointer to the node (neuron); only active buckets are kept around.
    • The K hash functions serve to increase the precision of the fingerprint -- found nodes are more expected to be active.
    • Have L hash tables for each hidden layer; these are used to increase the probability of finding useful / active nodes due to the randomness of the hash function.
    • Hash is asymmetric in the sense that the query and collection data are hashed independently.
  • In every layer during SGD, compute K x L hashes of the input, probe about 10 L buckets, and take their union. Experiments: K = 6 and L = 5.
  • See ref [30] where authors show around 500x reduction in computations for image search following different algorithmic and systems choices. Capsule: a camera based positioning system using learning {1506}
  • Use relatively small test data sets -- MNIST 8M, NORB, Convex, Rectangles -- each resized to have small-ish input vectors.

  • Really want more analysis of what exactly is going on here -- what happens when you change the hashing function, for example? How much is the training dependent on suitable ROC or precision/recall on the activation?
    • For example, they could have calculated the actual real activation & WTA selection, and compared it to the results from the hash function; how correlated are they?

hide / / print
ref: -2002 tags: hashing frequent items count sketch algorithm google date: 03-30-2020 02:04 gmt revision:7 [6] [5] [4] [3] [2] [1] [head]

Finding frequent items in data streams

  • Notation:
    • S is a data stream, S=q 1,q 2,...,q n S = q_1, q_2, ..., q_n length n.
    • Each object q iO=o 1,...o mq_i \in O = {o_1, ... o_m} That is, there are m total possible objects (e.g. English words).
    • Object o i o_i occurs n in_i times in S. The o no_n are ordered so that n 1n 2n m n_1 \geq n_2 \geq n_m .
  • Task:
    • Given an input stream S, integer k, and real ε\epsilon
    • Output a list of k elements from S such that each element has n i>(1ε)n k n_i \gt (1-\epsilon)n_k .
      • That is, if the ordering is perfect, n in k n_i \geq n_k , with equality on the last element.
  • Algorithm:
    • h 1,...,h th_1, ..., h_t hashes from object q to buckets 1,...,b{1, ..., b}
    • s 1,...,s ts_1, ..., s_t hashes from object q to 1,+1{-1, +1}
    • For each symbol, add it to the 2D hash array by hashing first with h ih_i , then increment that counter with s is_i .
      • The double-hasihing is to reduce the effect of collisions with high-frequency items.
    • When querying for frequency of a object, hash like others, and take the median over i of h i[q]*s i[q] h_i[q] * s_i[q]
    • t=O(log(nδ))t = O(log(\frac{n}{\delta})) where the algorithm fails with at most probability δ\delta
  • Demonstrate proof of convergence / function with Zipfian distributions with varying exponent. (I did not read through this).
  • Also showed that it's possible to compare these hash-counts directly to see what's changed,or importantly if the documents are different.

Mission: Ultra large-scale feature selection using Count-Sketches
  • Task:
    • Given a labeled dataset (X i,y i)(X_i, y_i) for i1,2,...,ni \in {1,2, ..., n} and X i p,y iX_i \in \mathbb{R}^p, y_i \in \mathbb{R}
    • Find the k-sparse feature vector / linear regression for the mean squares problem min||B|| 0=k||yXΒ|| 2 \frac{min}{||B||_0=k} ||y-X\Beta||_2
      • ||B|| 0=k ||B||_0=k counts the non-zero elements in the feature vector.
    • THE number of features pp is so large that a dense Β\Beta cannot be stored in memory. (X is of course sparse).
  • Such data may be from ad click-throughs, or from genomic analyses ...
  • Use the count-sketch algorithm (above) for capturing & continually updating the features for gradient update.
    • That is, treat the stream of gradient updates, in the normal form g i=2λ(y iX iΒ iX t) tX ig_i = 2 \lambda (y_i - X_i \Beta_i X^t)^t X_i , as the semi-continuous time series used above as SS
  • Compare this with greedy thresholding, Iterative hard thresholding (IHT) e.g. throw away gradient information after each batch.
    • This discards small gradients which may be useful for the regression problem.
  • Works better, but not necessarily better than straight feature hashing (FH).
  • Meh.

hide / / print
ref: -2017 tags: locality sensitive hashing olfaction kenyon cells neuron sparse representation date: 01-18-2020 21:13 gmt revision:1 [0] [head]

PMID-29123069 A neural algorithm for a fundamental computing problem

  • Ceneral idea: locality-sensitive hashing, e.g. hashing that is sensitive to the high-dimensional locality of the input space, can be efficiently solved using a circuit inspired by the insect olfactory system.
  • Here, activation of 50 different types of ORNs is mapped to 50 projection neurons, which 'centers the mean' -- concentration dependence is removed.
  • This is then projected via a random matrix of sparse binary weights to a much larger set of Kenyon cells, which in turn are inhibited by one APL neuron.
  • Normal locality-sensitive hashing uses dense matrices of Gaussian-distributed random weights, which means higher computational complexity...
  • ... these projections are governed by the Johnson-Lindenstrauss lemma, which says that projection from high-d to low-d space can preserve locality (distance between points) within an error bound.
  • Show that the WTA selection of the top 5% plus random binary weight preserves locality as measured by overlap with exact input locality on toy data sets, including MNIST and SIFT.
  • Flashy title as much as anything else got this into Science... indeed, has only been cited 6 times in Pubmed.