Flajolet-Martin Estimator
Paper: "Probabilistic counting algorithms for data base applications"
Innovation: First probabilistic cardinality estimation using bit patterns
Key Insight: Use hash function rank (trailing zeros) to estimate cardinality
LogLog Algorithm
Authors: Durand & Flajolet
Innovation: Introduced bucketing to reduce variance
Memory: log(log(N)) bits for N distinct elements
Improvement: Better accuracy through averaging multiple estimates
SuperLogLog
Enhancement: Added outlier elimination
Method: Remove ~30% of largest bucket values
Benefit: Reduced impact of hash collisions and outliers
HyperLogLog
Authors: Flajolet et al.
Innovation: Harmonic mean instead of arithmetic mean
Accuracy: Near-optimal cardinality estimation
Industry Standard: Widely adopted in big data systems
Philippe Flajolet (1948-2011)
Pioneer of analytic combinatorics and probabilistic algorithms. His work laid the foundation for modern cardinality estimation techniques used in big data analytics worldwide.
Flajolet-Martin Estimator (1985)
The foundational algorithm that started it all. Uses a single bitmap to track the maximum rank observed.
Limitations:
- High variance in estimates
- Single point of failure
- Limited accuracy for practical applications
LogLog Estimator (2003)
Introduced bucketing to reduce variance by averaging multiple independent estimates.
Key Innovation: Bucketing reduces variance by factor of √m
SuperLogLog Estimator (2007)
Enhanced LogLog with outlier elimination to improve robustness.
Improvement: More robust against hash collisions and extreme values
HyperLogLog Estimator (2007)
The breakthrough algorithm using harmonic mean for near-optimal accuracy.
Revolutionary Features:
- Harmonic mean reduces bias significantly
- Sophisticated bias correction formulas
- Small and large range corrections
- Near-optimal accuracy with minimal memory
Why "LogLog"?
The name comes from the memory requirement: to count cardinalities up to Nmax, these algorithms need approximately log(log(Nmax)) bits. This means cardinalities in the billions can be estimated using just 1-2 kilobytes of memory!
Performance Comparison
| Algorithm | Year | Memory | Accuracy | Key Innovation |
|---|---|---|---|---|
| Flajolet-Martin | 1985 | 32 bits | High variance | First probabilistic counting |
| LogLog | 2003 | log(log(N)) | Good | Bucketing for variance reduction |
| SuperLogLog | 2007 | log(log(N)) | Better | Outlier elimination |
| HyperLogLog | 2007 | log(log(N)) | Near-optimal | Harmonic mean + range corrections |
Real-World Impact
HyperLogLog has become the industry standard for cardinality estimation:
- Google: BigQuery, Analytics
- Facebook: Audience insights, ad targeting
- Amazon: Redshift, DynamoDB
- Apache: Spark, Druid, Pinot
- Redis: PFADD, PFCOUNT commands
Why HyperLogLog Won
- Accuracy: Standard error of ~1.04/√m
- Memory: Fixed size regardless of cardinality
- Mergeable: Sketches can be combined
- Parallelizable: Easy to distribute computation