Sketches Code Reference¶

Plain-Old Count-Min Sketch¶

count_min_sketch.py

This module implements the basic count-min sketch class, from which most others inherit. It also offers an implementation of a count-min sketch that keeps track of some number of most commonly recurring items.

class minsketch.count_min_sketch.CountMinSketch(delta, epsilon, depth=None, width=None, table_class=<class 'minsketch.sketch_tables.ListBackedSketchTable'>, hash_strategy=<class 'minsketch.hash_strategy.NaiveHashingStrategy'>, update_strategy=<class 'minsketch.update_strategy.NaiveUpdateStrategy'>, lossy_strategy=<class 'minsketch.lossy_strategy.NoLossyUpdateStrategy'>)[source]¶

A minimalist implementation of count-min sketches

get(item)[source]¶

Retrieve the current minimum value associated with an item

Parameters:	item – The item to retrieve
Returns:	The best current estimate for how often it appeared so far

inner_product_query(first_item, second_item)[source]¶

Not expanded upon, but a simple implementation of the idea of an inner-product query using a CM-sketch, as dicussed in the original paper

Parameters:	first_item – The first item queried second_item – The second item queried
Returns:	The best estimate for their inner-product within the CM-sketch

insert(item, count=1)[source]¶

Insert an item with a given count into the CM-sketch, using the provided hashing, lossy counting, and updating strategies

Parameters:	item – The item to insert. Must be a number or sensibly hashable

(say, a string) :param count: The count to insert with, defaults to 1 :return: The new minimal value of this item in the CM-sketch.

update(items, counts=None)[source]¶

Implement a group update of items and counts

Parameters:	items – The items to update in counts – The count of each item. If exists, assumed to be the

same size as items. :return: None

class minsketch.count_min_sketch.TopNCountMinSketch(delta, epsilon, depth=None, width=None, n=100, table_class=<class 'minsketch.sketch_tables.ListBackedSketchTable'>, hash_strategy=<class 'minsketch.hash_strategy.NaiveHashingStrategy'>, update_strategy=<class 'minsketch.update_strategy.NaiveUpdateStrategy'>, lossy_strategy=<class 'minsketch.lossy_strategy.NoLossyUpdateStrategy'>)[source]¶

Implementation of a CountMinSketch supporting tracking the top-N items.

Inspired in part by the following blog post: https://tech.shareaholic.com/2012/12/03/the-count-min-sketch-how-to-count-over-large-keyspaces-when-about-right-is-good-enough/

insert(item, count=1)[source]¶

Insert the item, and update the top-N registry.

Parameters:	item – The item to insert. Must be a number or sensibly hashable

(say, a string) :param count: The count to insert with, defaults to 1 :return: The new minimal count for this item

most_common(k=None)[source]¶: Return the most common k items, for k <= n. Accepting a parameter to conform to python’s Counter interface, and returning the results in their style :param k: The number of top items to return, at most n, as this function was initialized :return: The top k items observed by this CM sketch, and how often they were seen

Double-Hashing Count-Min Sketch¶

double_hashing.py

Two implementations of ideas from “Less Hashing, Same Performance: Building a Better Bloom Filter,” by Kirsch and Mitzenmacher (2008): https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf

class minsketch.double_hashing.HashPairCMSketch(delta, epsilon, hash_gen=None, n=100, table_class=<class 'minsketch.sketch_tables.ListBackedSketchTable'>, update_strategy=<class 'minsketch.update_strategy.NaiveUpdateStrategy'>, lossy_strategy=<class 'minsketch.lossy_strategy.NoLossyUpdateStrategy'>)[source]¶: A single hash-pair count-min sketch, based on Kirsch & Mitzenmacher (2008). This essentially ignores the epsilon parameter, and provides a robust implementation of a count-min sketch using only two hash functions.

class minsketch.double_hashing.MultiHashPairTopNCMSketch(delta, epsilon, n=100, table_class=<class 'minsketch.sketch_tables.ListBackedSketchTable'>, update_strategy=<class 'minsketch.update_strategy.NaiveUpdateStrategy'>, lossy_strategy=<class 'minsketch.lossy_strategy.NoLossyUpdateStrategy'>)[source]¶

From the previously referenced paper: “Given such a result, it is straightforward to obtain a variation that uses \(2\frac{\ln \frac{1}{\delta}}{\ln \frac{1}{\epsilon}}\) pairwise independent hash functions and achieves the desired failure probability \(\delta\): simply build \(2\frac{\ln \frac{1}{\delta}}{\ln \frac{1}{\epsilon}}\) independent copies of this data structure, and always answer a point query with the minimum estimate given by one of those copies.”

I believe the second 2 is a mistake - so I only build \(\frac{\ln \frac{1}{\delta}}{\ln \frac{1}{\epsilon}}\) rounded up sketches.

get(item)[source]¶

Retrieve the current minimum value associated with an item

Parameters:	item – The item to retrieve
Returns:	The best current estimate for how often it appeared so far

insert(item, count=1)[source]¶

Insert the item, and update the top-N registry.

Parameters:	item – The item to insert. Must be a number or sensibly hashable

(say, a string) :param count: The count to insert with, defaults to 1 :return: The new minimal count for this item

Counter-Sketch Hybrid¶

counter_sketch_hybrid.py

This module implements a best-of-both-worlds approach - as python Counter objects are much faster than all count-min sketches implemented, but take indefinite amounts of space, we create a hybrid - saving new data into a Counter as it arrives, and batch-updating the count-min sketch.

class minsketch.counter_sketch_hybrid.SketchCounterHybrid(sketch, batch_size=10000)[source]¶

The hybrid object, as described in the module documentation.

On initial benchmarking, using the default batch size, it provides a performance boost of about 10x over using the sketch alone (on ~1M words, from 88 seconds to 8), at the cost about 2.5x memory (208kb to 500kb).

Count-Min-Mean Sketch¶

count_mean_sketch.py

This model implements count-mean-min sketches, as described by Goyal and Daumé (2011): http://www.aclweb.org/anthology/D12-1100

The design here is a little iffy - there’s a fair amount of code duplication between the two classes. Perhaps a mixin would fit better, or changing the abstractions.

class minsketch.count_mean_sketch.CountMeanMinSketch(delta, epsilon, depth=None, width=None, n=100, table_class=<class 'minsketch.sketch_tables.ListBackedSketchTable'>, hash_strategy=<class 'minsketch.hash_strategy.NaiveHashingStrategy'>, update_strategy=<class 'minsketch.update_strategy.NaiveUpdateStrategy'>, lossy_strategy=<class 'minsketch.lossy_strategy.NoLossyUpdateStrategy'>)[source]¶

A count-mean-min sketch, based on the regular TopNCountMinSketch.

The two main changes include using error-estimates from the updater (regular or conservative), and returning min between the minimum of the regularly queried values, and the median of the values after accounting for the error estimates.

get(item)[source]¶

Retrieve the current minimum value associated with an item

Parameters:	item – The item to retrieve
Returns:	The best current estimate for how often it appeared so far

most_common(k=None)[source]¶: Re-query the top-n keys using the count-mean query algorithm, and then return the top k of the re-queried results :param k: The number of top items to return, at most n, as this function was initialized :return: The top k items observed by this CM sketch, and how often they were seen

class minsketch.count_mean_sketch.HashPairCountMeanMinSketch(delta, epsilon, hash_gen=None, n=100, table_class=<class 'minsketch.sketch_tables.ListBackedSketchTable'>, update_strategy=<class 'minsketch.update_strategy.NaiveUpdateStrategy'>, lossy_strategy=<class 'minsketch.lossy_strategy.NoLossyUpdateStrategy'>)[source]¶

A count-mean-min sketch, based on the HashPair CM SKETCH.

The two main changes include using error-estimates from the updater (regular or conservative), and returning min between the minimum of the regularly queried values, and the median of the values after accounting for the error estimates.

get(item)[source]¶

Retrieve the current minimum value associated with an item

Parameters:	item – The item to retrieve
Returns:	The best current estimate for how often it appeared so far

most_common(k=None)[source]¶: Re-query the top-n keys using the count-mean query algorithm, and then return the top k of the re-queried results :param k: The number of top items to return, at most n, as this function was initialized :return: The top k items observed by this CM sketch, and how often they were seen

Least-Squares Estimating Sketch¶

least_squares_sketch.py

A least-squares estimator for count-min sketches, as introduced by Lee, Lui, Yoon, & Zhang: https://www.usenix.org/legacy/event/imc05/tech/full_papers/lee/lee.pdf

class minsketch.least_squares_sketch.LeastSquaresTopNSketch(delta, epsilon, depth=None, width=None, n=100, table_class=<class 'minsketch.sketch_tables.ListBackedSketchTable'>, hash_strategy=<class 'minsketch.hash_strategy.NaiveHashingStrategy'>, update_strategy=<class 'minsketch.update_strategy.NaiveUpdateStrategy'>, lossy_strategy=<class 'minsketch.lossy_strategy.NoLossyUpdateStrategy'>)[source]¶

The least-squares top-N sketch only overrides the most common item finding heuristic:

most_common(k=None)[source]¶

Generate a better estimate for the top k most-common entries using the least squares estimator. We always solve the solution for the entire top n, and return the first k, to allow for more accuracy if only a subset is wanted (as the rest end up being used as noise variables).

Formally, we construct \(\vec{b}\) as a concatenation of the tables rows, implemented by table.to_vector(). The size of \(\vec{b}\) is thus (table.depth * table.width) x 1.

We construct \(A\) as having the same number or rows as \(\vec{b}\), and having \(k + 1\) columns, using the following values:

for \(i \in \{0, 1, ..., \text{table.width} - 1\}\) and \(j \in \{0, 1, ..., \text{table.depth} - 1 \}\),

we set \(A_{\text{table.depth} * i + j + 1, l} = 1\) if either the item \(x_l\) is hashed into \(T[i][j]\), or if \(l = \text{table.width} + 1\), and otherwise we leave it as 0.

We then solve using least-squares estimation (using the pseudoinverse) \(A\vec{x}=\vec{b}\), sort the results in a descending fashion (discarding the noise), and report the top k entries.

Parameters:	k – The number of top results to return - <= to n
Returns:	The least-squares estimate for how often these top k appeared