Dobrý odhad vydá za tisíc přesných počtů
16. 11. 2017
Původně jsem chtěl napsat něco obecného o pravděpodobnostních algoritmech a užitečnosti rychlých odhadů, ale nemám na to sílu. Místo toho jsem vybral pár článků a paperů, které se týkají tématu, roztřídil je na obecné a ty o Bloom filtrech, HyperLogLogách, Count-min skečích a další. Tady to máte:
obecné
- Probabilistic Data Structures for Web Analytics and Data Mining
- Algebra for Analytics
- ApproxHadoop: Bringing Approximations to MapReduce Frameworks
- scala.collection.approximate
- Data Sketches - Fast, Approximate Analysis of Big Data
- Hash Collision Probabilities
- Probabilistic techniques, data streams and online learning - Looking forward to a bigger 2015
- Stream Processing and Probabilistic Methods: Data at Scale
- The Design of Approximation Algorithms
- References for Data Stream Algorithms
Bloom
- Bloomin' Marvellous
- A Garden Variety of Bloom Filters
- Adaptive Cuckoo Filters
- Bloofi: Multidimensional Bloom Filters
- Bloom filter
- Scalable Bloom Filters
- dablooms - an open source, scalable, counting bloom filter library
- Why Bloom filters work the way they do
- The Bloomier Filter: An Efficient Data Structure for Static Support Lookup Tables
- Cuckoo Filter: Practically Better than Bloom
- Cuckoo Linear Algebra
- Cuckoo filters and their analysis
- Cuckoo Filters
- Algorithmic improvements for fast concurrent cuckoo hashing
- Don’t Thrash: How to Cache Your Hash on Flash
- Flower Filter – A simple time-decaying approximate membership filter
- Golomb-coded sets: smaller than Bloom filters
- The Opposite of a Bloom Filter
- Invertible Bloom Lookup Tables
HyperLogLog
- HLL Intersections
- Hacking Cassandra
- HyperLogLog and MinHash - A Union for Intersections
- HyperLogLog and MinHash
- HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm
- HyperLogLog in Practice: Algorithmic Engineering of a State of the Art Cardinality Estimation Algorithm
- HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
- HyperLogSandwich
- Redis new data structure: the HyperLogLog
- Min Value: Jak odhadnout počet unikátních hodnot
- Linear Counting: Jak odhadnout počet unikátních hodnot
- LogLog: Jak odhadnout počet unikátních hodnot
- HyperLogLog: Jak odhadnout počet unikátních hodnot
- Sjednocení Hyperloglogu
- Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure
- It's Log! It's Log! It's Big, It's Hyper, It's Good!
- Finding Frequent Items in Data Streams
Count-min sketch
- Count-Min Sketch - A Data Structure for Stream Mining Applications
- AK Data Science Summit - S. Muthu Muthukrishnan - Count-Min Sketch: 10 Years Later
- A generic version of CountMinSketch[T] to decouple CMS from Long?
Locality sensitive hashing
- A Brief Index for Proximity Searching
- Comparative Analysis of Data Structures for Approximate Nearest Neighbor Search
- Effective Proximity Retrieval by Ordering Permutations
- Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs
- Engineering Efficient and Effective Non-Metric Space Library
- Fast Computation of min-Hash Signatures for Image Collections
- Hashing, sketching, and other approximate algorithms for high-dimensional data
- K-medoids LSH: a new locality sensitive hashing in general metric space
- Kernelized Locality-Sensitive Hashing Page
- LSH forest: self-tuning indexes for similarity search.
- Large-Scale Distributed Locality-Sensitive Hashing for General Metric Data
- Large-scale similarity data management with distributed Metric Index
- Learning to Prune in Metric and Non-Metric Spaces
- Locality-Sensitive Hashing Scheme Based on p-Stable Distributions
- Metric Space Searching Based on Random Bisectors and Binary Fingerprints
- MinHashing and LSH - dimensionality reduction for finding similar items among massive datasets
- MinHashing and LSH - dimensionality reduction for finding similar items among massive datasets, Part 2
- Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search
- Nearest neighbors and vector models – part 2 – algorithms and data structures
- Off the Beaten Path: Let’s Replace Term-Based Retrieval with k-NN Search
- On Locality Sensitive Hashing in metric spaces
- On Locality-sensitive indexing in generic metric spaces
- Optimal Data-Dependent Hashing for Nearest Neighbor Search
- Permutation Search Methods are Efficient, Yet Faster Search is Possible
- Practical and Optimal LSH for Angular Distance
- Robust and Efficient Locality Sensitive Hashing for Nearest Neighbor Search in Large Data Sets
- Spectral Hashing
- Speeding Up Permutation Based Indexing with Indexing
- Stable Distributions, Pseudorandom Generators, Embeddings, and Data Stream Computation
- Succinct Nearest Neighbor Search
ostatní
- Sketch of the Day: Frugal Streaming
- q-digest - Medians and Beyond: New Aggregation Techniques for Sensor Networks
- q-digest : an algorithm for computing approximate quantiles on a collection of integers
- The Art of Approximating Distributions: Histograms and Quantiles at Scale
- Doing-the-impossible
- Moment-based quantile sketches for efficient high cardinality aggregation queries