Sketches Code Reference¶
Plain-Old Count-Min Sketch¶
count_min_sketch.py
This module implements the basic count-min sketch class, from which most others inherit. It also offers an implementation of a count-min sketch that keeps track of some number of most commonly recurring items.
-
class
minsketch.count_min_sketch.
CountMinSketch
(delta, epsilon, depth=None, width=None, table_class=<class 'minsketch.sketch_tables.ListBackedSketchTable'>, hash_strategy=<class 'minsketch.hash_strategy.NaiveHashingStrategy'>, update_strategy=<class 'minsketch.update_strategy.NaiveUpdateStrategy'>, lossy_strategy=<class 'minsketch.lossy_strategy.NoLossyUpdateStrategy'>)[source]¶ A minimalist implementation of count-min sketches
-
get
(item)[source]¶ Retrieve the current minimum value associated with an item
Parameters: item – The item to retrieve Returns: The best current estimate for how often it appeared so far
-
inner_product_query
(first_item, second_item)[source]¶ Not expanded upon, but a simple implementation of the idea of an inner-product query using a CM-sketch, as dicussed in the original paper
Parameters: - first_item – The first item queried
- second_item – The second item queried
Returns: The best estimate for their inner-product within the CM-sketch
-
insert
(item, count=1)[source]¶ Insert an item with a given count into the CM-sketch, using the provided hashing, lossy counting, and updating strategies
Parameters: item – The item to insert. Must be a number or sensibly hashable (say, a string) :param count: The count to insert with, defaults to 1 :return: The new minimal value of this item in the CM-sketch.
-
-
class
minsketch.count_min_sketch.
TopNCountMinSketch
(delta, epsilon, depth=None, width=None, n=100, table_class=<class 'minsketch.sketch_tables.ListBackedSketchTable'>, hash_strategy=<class 'minsketch.hash_strategy.NaiveHashingStrategy'>, update_strategy=<class 'minsketch.update_strategy.NaiveUpdateStrategy'>, lossy_strategy=<class 'minsketch.lossy_strategy.NoLossyUpdateStrategy'>)[source]¶ Implementation of a CountMinSketch supporting tracking the top-N items.
Inspired in part by the following blog post: https://tech.shareaholic.com/2012/12/03/the-count-min-sketch-how-to-count-over-large-keyspaces-when-about-right-is-good-enough/
-
insert
(item, count=1)[source]¶ Insert the item, and update the top-N registry.
Parameters: item – The item to insert. Must be a number or sensibly hashable (say, a string) :param count: The count to insert with, defaults to 1 :return: The new minimal count for this item
-
most_common
(k=None)[source]¶ Return the most common k items, for k <= n. Accepting a parameter to conform to python’s Counter interface, and returning the results in their style :param k: The number of top items to return, at most n, as this function was initialized :return: The top k items observed by this CM sketch, and how often they were seen
-
Double-Hashing Count-Min Sketch¶
double_hashing.py
Two implementations of ideas from “Less Hashing, Same Performance: Building a Better Bloom Filter,” by Kirsch and Mitzenmacher (2008): https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
-
class
minsketch.double_hashing.
HashPairCMSketch
(delta, epsilon, hash_gen=None, n=100, table_class=<class 'minsketch.sketch_tables.ListBackedSketchTable'>, update_strategy=<class 'minsketch.update_strategy.NaiveUpdateStrategy'>, lossy_strategy=<class 'minsketch.lossy_strategy.NoLossyUpdateStrategy'>)[source]¶ A single hash-pair count-min sketch, based on Kirsch & Mitzenmacher (2008). This essentially ignores the epsilon parameter, and provides a robust implementation of a count-min sketch using only two hash functions.
-
class
minsketch.double_hashing.
MultiHashPairTopNCMSketch
(delta, epsilon, n=100, table_class=<class 'minsketch.sketch_tables.ListBackedSketchTable'>, update_strategy=<class 'minsketch.update_strategy.NaiveUpdateStrategy'>, lossy_strategy=<class 'minsketch.lossy_strategy.NoLossyUpdateStrategy'>)[source]¶ From the previously referenced paper: “Given such a result, it is straightforward to obtain a variation that uses \(2\frac{\ln \frac{1}{\delta}}{\ln \frac{1}{\epsilon}}\) pairwise independent hash functions and achieves the desired failure probability \(\delta\): simply build \(2\frac{\ln \frac{1}{\delta}}{\ln \frac{1}{\epsilon}}\) independent copies of this data structure, and always answer a point query with the minimum estimate given by one of those copies.”
I believe the second 2 is a mistake - so I only build \(\frac{\ln \frac{1}{\delta}}{\ln \frac{1}{\epsilon}}\) rounded up sketches.
Counter-Sketch Hybrid¶
counter_sketch_hybrid.py
This module implements a best-of-both-worlds approach - as python Counter objects are much faster than all count-min sketches implemented, but take indefinite amounts of space, we create a hybrid - saving new data into a Counter as it arrives, and batch-updating the count-min sketch.
-
class
minsketch.counter_sketch_hybrid.
SketchCounterHybrid
(sketch, batch_size=10000)[source]¶ The hybrid object, as described in the module documentation.
On initial benchmarking, using the default batch size, it provides a performance boost of about 10x over using the sketch alone (on ~1M words, from 88 seconds to 8), at the cost about 2.5x memory (208kb to 500kb).
Count-Min-Mean Sketch¶
count_mean_sketch.py
This model implements count-mean-min sketches, as described by Goyal and Daumé (2011): http://www.aclweb.org/anthology/D12-1100
The design here is a little iffy - there’s a fair amount of code duplication between the two classes. Perhaps a mixin would fit better, or changing the abstractions.
-
class
minsketch.count_mean_sketch.
CountMeanMinSketch
(delta, epsilon, depth=None, width=None, n=100, table_class=<class 'minsketch.sketch_tables.ListBackedSketchTable'>, hash_strategy=<class 'minsketch.hash_strategy.NaiveHashingStrategy'>, update_strategy=<class 'minsketch.update_strategy.NaiveUpdateStrategy'>, lossy_strategy=<class 'minsketch.lossy_strategy.NoLossyUpdateStrategy'>)[source]¶ A count-mean-min sketch, based on the regular TopNCountMinSketch.
The two main changes include using error-estimates from the updater (regular or conservative), and returning min between the minimum of the regularly queried values, and the median of the values after accounting for the error estimates.
-
get
(item)[source]¶ Retrieve the current minimum value associated with an item
Parameters: item – The item to retrieve Returns: The best current estimate for how often it appeared so far
-
most_common
(k=None)[source]¶ Re-query the top-n keys using the count-mean query algorithm, and then return the top k of the re-queried results :param k: The number of top items to return, at most n, as this function was initialized :return: The top k items observed by this CM sketch, and how often they were seen
-
-
class
minsketch.count_mean_sketch.
HashPairCountMeanMinSketch
(delta, epsilon, hash_gen=None, n=100, table_class=<class 'minsketch.sketch_tables.ListBackedSketchTable'>, update_strategy=<class 'minsketch.update_strategy.NaiveUpdateStrategy'>, lossy_strategy=<class 'minsketch.lossy_strategy.NoLossyUpdateStrategy'>)[source]¶ A count-mean-min sketch, based on the HashPair CM SKETCH.
The two main changes include using error-estimates from the updater (regular or conservative), and returning min between the minimum of the regularly queried values, and the median of the values after accounting for the error estimates.
-
get
(item)[source]¶ Retrieve the current minimum value associated with an item
Parameters: item – The item to retrieve Returns: The best current estimate for how often it appeared so far
-
most_common
(k=None)[source]¶ Re-query the top-n keys using the count-mean query algorithm, and then return the top k of the re-queried results :param k: The number of top items to return, at most n, as this function was initialized :return: The top k items observed by this CM sketch, and how often they were seen
-
Least-Squares Estimating Sketch¶
least_squares_sketch.py
A least-squares estimator for count-min sketches, as introduced by Lee, Lui, Yoon, & Zhang: https://www.usenix.org/legacy/event/imc05/tech/full_papers/lee/lee.pdf
-
class
minsketch.least_squares_sketch.
LeastSquaresTopNSketch
(delta, epsilon, depth=None, width=None, n=100, table_class=<class 'minsketch.sketch_tables.ListBackedSketchTable'>, hash_strategy=<class 'minsketch.hash_strategy.NaiveHashingStrategy'>, update_strategy=<class 'minsketch.update_strategy.NaiveUpdateStrategy'>, lossy_strategy=<class 'minsketch.lossy_strategy.NoLossyUpdateStrategy'>)[source]¶ The least-squares top-N sketch only overrides the most common item finding heuristic:
-
most_common
(k=None)[source]¶ Generate a better estimate for the top k most-common entries using the least squares estimator. We always solve the solution for the entire top n, and return the first k, to allow for more accuracy if only a subset is wanted (as the rest end up being used as noise variables).
Formally, we construct \(\vec{b}\) as a concatenation of the tables rows, implemented by table.to_vector(). The size of \(\vec{b}\) is thus (table.depth * table.width) x 1.
We construct \(A\) as having the same number or rows as \(\vec{b}\), and having \(k + 1\) columns, using the following values:
for \(i \in \{0, 1, ..., \text{table.width} - 1\}\) and \(j \in \{0, 1, ..., \text{table.depth} - 1 \}\),
we set \(A_{\text{table.depth} * i + j + 1, l} = 1\) if either the item \(x_l\) is hashed into \(T[i][j]\), or if \(l = \text{table.width} + 1\), and otherwise we leave it as 0.
We then solve using least-squares estimation (using the pseudoinverse) \(A\vec{x}=\vec{b}\), sort the results in a descending fashion (discarding the noise), and report the top k entries.
Parameters: k – The number of top results to return - <= to n Returns: The least-squares estimate for how often these top k appeared
-