 m8ta
use https for features.
 text: sort by tags: modified type: chronology
 {1509} hide / / print ref: -2002 tags: hashing frequent items count sketch algorithm google date: 03-30-2020 02:04 gmt revision:7       [head] Notation: S is a data stream, $S = q_1, q_2, ..., q_n$ length n. Each object $q_i \in O = {o_1, ... o_m}$ That is, there are m total possible objects (e.g. English words). Object $o_i$ occurs $n_i$ times in S. The $o_n$ are ordered so that $n_1 \geq n_2 \geq n_m$ . Task: Given an input stream S, integer k, and real $\epsilon$ Output a list of k elements from S such that each element has $n_i \gt (1-\epsilon)n_k$ . That is, if the ordering is perfect, $n_i \geq n_k$ , with equality on the last element. Algorithm: $h_1, ..., h_t$ hashes from object q to buckets ${1, ..., b}$ $s_1, ..., s_t$ hashes from object q to ${-1, +1}$ For each symbol, add it to the 2D hash array by hashing first with $h_i$ , then increment that counter with $s_i$ . The double-hasihing is to reduce the effect of collisions with high-frequency items. When querying for frequency of a object, hash like others, and take the median over i of $h_i[q] * s_i[q]$ $t = O(log(\frac{n}{\delta}))$ where the algorithm fails with at most probability $\delta$ Demonstrate proof of convergence / function with Zipfian distributions with varying exponent. (I did not read through this). Also showed that it's possible to compare these hash-counts directly to see what's changed,or importantly if the documents are different. Mission: Ultra large-scale feature selection using Count-Sketches Task: Given a labeled dataset $(X_i, y_i)$ for $i \in {1,2, ..., n}$ and $X_i \in \mathbb{R}^p, y_i \in \mathbb{R}$ Find the k-sparse feature vector / linear regression for the mean squares problem $\frac{min}{||B||_0=k} ||y-X\Beta||_2$ $||B||_0=k$ counts the non-zero elements in the feature vector. THE number of features $p$ is so large that a dense $\Beta$ cannot be stored in memory. (X is of course sparse). Such data may be from ad click-throughs, or from genomic analyses ... Use the count-sketch algorithm (above) for capturing & continually updating the features for gradient update. That is, treat the stream of gradient updates, in the normal form $g_i = 2 \lambda (y_i - X_i \Beta_i X^t)^t X_i$ , as the semi-continuous time series used above as $S$ Compare this with greedy thresholding, Iterative hard thresholding (IHT) e.g. throw away gradient information after each batch. This discards small gradients which may be useful for the regression problem. Works better, but not necessarily better than straight feature hashing (FH). Meh.