minsketch

A flexible implementation of several min-sketch variants.

A count-min sketch is a probabilistic counting data structure - to save space, we allow for some known probability of counting error. This allows to summarize a data stream with limited space, without growing indefinitely as we see more and more data.

For a longer description, and some of the relevant papers, see: https://sites.google.com/site/countminsketch/

Features

  • Supports tables backed by Python lists, arrays, NumPy matrices, and bitarrays
  • Supports both standard hash-per-row and a double-hashing scheme
  • Supports both regular updating and conservative updating
  • Supports a Counter - sketch hybrid for a more balanced time-memory trade-off
  • Supports a number of different lossy-counting strategies
  • Supports using a linear equation solving estimation strategy

Example

import minsketch
from functools import partial
from timeit import timeit

# Delta and epsilon and uncertainty measures
# Count-min sketches are within epsilon of the true value
# With probability 1 - delta
delta = 10 ** -5
epsilon = 0.001

# This implementation supports several different backing table classes
# For most purposes, the Python array makes for a good default:
table_class = minsketch.sketch_tables.ArrayBackedSketchTable

# If you are willing to only support positive updates (no deletions)
# You should use conservative updates, as significantly reduces error:
update_strategy = minsketch.update_strategy.ConservativeUpdateStrategy

# If you have a reason to believe you might want lossy updating, you
# should benchmark one of the lossy updating schemes on your data:
lossy_strategy = minsketch.lossy_strategy.NoLossyUpdateStrategy

# So far, the hash-pair based implementation appears to outperform
# the universal hash family based one:
sketch = minsketch.double_hashing.HashPairCMSketch(
    delta, epsilon, table_class=table_class,
    update_strategy=update_strategy, lossy_strategy=lossy_strategy)

# Update the sketch with some data
from numpy import random
data = random.randint(0, 1000, 100000)
print(timeit(partial(sketch.update, data), number=1))

# Query the ten most common elements:
print(sketch.most_common(10))

# For a performance boost, you can also use a Counter-Sketch Hybrid
hybrid = minsketch.counter_sketch_hybrid.SketchCounterHybrid(
    minsketch.double_hashing.HashPairCMSketch(
    delta, epsilon, table_class=table_class,
    update_strategy=update_strategy, lossy_strategy=lossy_strategy))

print(timeit(partial(hybrid.update, data), number=1))
print(hybrid.most_common(10))

TODO

  • Add a hybrid Counter into Count-Min sketch object to enjoy the best of both worlds
  • Add to PyPi.
  • Add some benchmark results to the documentation
  • Consider adding some benchmark graphs?

Installation

Pull from github, run setup.py, rejoice. Will be added to PyPi soon.

Support

If you are having issues, please let me know: prefix@suffix where prefix = guy and suffix = minerva.kgi.edu

License

The project is licensed under the MIT license.