Algorithm Evolution

From Flajolet-Martin to HyperLogLog: 22 Years of Innovation

CardinalityKit Documentation
1985

Flajolet-Martin Estimator

Paper: "Probabilistic counting algorithms for data base applications"

Innovation: First probabilistic cardinality estimation using bit patterns

Key Insight: Use hash function rank (trailing zeros) to estimate cardinality

2003

LogLog Algorithm

Authors: Durand & Flajolet

Innovation: Introduced bucketing to reduce variance

Memory: log(log(N)) bits for N distinct elements

Improvement: Better accuracy through averaging multiple estimates

2007

SuperLogLog

Enhancement: Added outlier elimination

Method: Remove ~30% of largest bucket values

Benefit: Reduced impact of hash collisions and outliers

2007

HyperLogLog

Authors: Flajolet et al.

Innovation: Harmonic mean instead of arithmetic mean

Accuracy: Near-optimal cardinality estimation

Industry Standard: Widely adopted in big data systems

Philippe Flajolet (1948-2011)

Pioneer of analytic combinatorics and probabilistic algorithms. His work laid the foundation for modern cardinality estimation techniques used in big data analytics worldwide.

Flajolet-Martin Estimator (1985)

The foundational algorithm that started it all. Uses a single bitmap to track the maximum rank observed.

def FlajoletMartin(hashes): # Initialize 32-bit bitmap bitMap = ["0"] * 32 # Process each hash for hash_val in hashes: # Count trailing zeros (rank) rank = 0 for c in reversed(str(hash_val)): if c == "0": rank += 1 else: break # Set corresponding bit to 1 bitMap[rank] = "1" # Find first zero position r = 0 for bit in bitMap: if bit == "1": r += 1 else: break # Calculate estimate with bias correction estimate = 2 ** r BIAS = 0.77351 estimate = estimate / BIAS return estimate

Limitations:

  • High variance in estimates
  • Single point of failure
  • Limited accuracy for practical applications

LogLog Estimator (2003)

Introduced bucketing to reduce variance by averaging multiple independent estimates.

def LogLog(hashes, k): m = 2 ** k # Number of buckets buckets = [0] * m for hash_val in hashes: # Get bucket (first k bits) j = int(str(hash_val)[:k], 2) # Get rank from remaining bits data = hash_val[k:] rank = 1 for c in reversed(data): if c == "0": rank += 1 else: break # Keep maximum rank per bucket buckets[j] = max(buckets[j], rank) # Calculate estimate using arithmetic mean average = sum(buckets) / m estimate = m * (2 ** average) # Bias correction BIAS = 0.397011808 estimate = BIAS * estimate return estimate

Key Innovation: Bucketing reduces variance by factor of √m

SuperLogLog Estimator (2007)

Enhanced LogLog with outlier elimination to improve robustness.

def SuperLogLog(hashes, k): m = 2 ** k buckets = [0] * m # Same bucket processing as LogLog for hash_val in hashes: j = int(str(hash_val)[:k], 2) data = hash_val[k:] rank = 1 for c in reversed(data): if c == "0": rank += 1 else: break buckets[j] = max(buckets[j], rank) # Outlier elimination: remove ~30% largest values cutoff = 0.7 lower = math.floor(cutoff * m) buckets.sort() # Remove upper 30% of values for i in range(m - lower): del buckets[-1] # Calculate estimate from remaining buckets average = sum(buckets) / lower estimate = m * 2 ** average # Empirical bias correction BIAS = 0.764 estimate = BIAS * estimate return estimate

Improvement: More robust against hash collisions and extreme values

HyperLogLog Estimator (2007)

The breakthrough algorithm using harmonic mean for near-optimal accuracy.

def HyperLogLog(hashes, k): m = 2 ** k buckets = [0] * m # Process hashes (same as previous algorithms) for hash_val in hashes: j = int(str(hash_val)[:k], 2) data = hash_val[k:] rank = 1 for c in reversed(data): if c == "0": rank += 1 else: break buckets[j] = max(buckets[j], rank) # Key innovation: harmonic mean total = 0 for bucket in buckets: total += 2 ** (-1 * bucket) harmonic_mean = total ** -1 estimate = (m ** 2) * harmonic_mean # Sophisticated bias correction if k <= 4: BIAS = 0.673 elif k == 5: BIAS = 0.697 else: BIAS = 0.7213 / (1 + (1.079 / m)) estimate = BIAS * estimate # Range corrections for small and large estimates if estimate < ((5 / 2) * m): zeros = sum(1 for bucket in buckets if bucket == 0) if zeros != 0: estimate = m * math.log(m / zeros) elif estimate > ((2 ** 32) / 30): estimate = -1 * (2 ** 32) * math.log(1 - (estimate / (2 ** 32))) return estimate

Revolutionary Features:

  • Harmonic mean reduces bias significantly
  • Sophisticated bias correction formulas
  • Small and large range corrections
  • Near-optimal accuracy with minimal memory

Why "LogLog"?

The name comes from the memory requirement: to count cardinalities up to Nmax, these algorithms need approximately log(log(Nmax)) bits. This means cardinalities in the billions can be estimated using just 1-2 kilobytes of memory!

Performance Comparison

Algorithm Year Memory Accuracy Key Innovation
Flajolet-Martin 1985 32 bits High variance First probabilistic counting
LogLog 2003 log(log(N)) Good Bucketing for variance reduction
SuperLogLog 2007 log(log(N)) Better Outlier elimination
HyperLogLog 2007 log(log(N)) Near-optimal Harmonic mean + range corrections

Real-World Impact

HyperLogLog has become the industry standard for cardinality estimation:

  • Google: BigQuery, Analytics
  • Facebook: Audience insights, ad targeting
  • Amazon: Redshift, DynamoDB
  • Apache: Spark, Druid, Pinot
  • Redis: PFADD, PFCOUNT commands

Why HyperLogLog Won

  • Accuracy: Standard error of ~1.04/√m
  • Memory: Fixed size regardless of cardinality
  • Mergeable: Sketches can be combined
  • Parallelizable: Easy to distribute computation
Next: HyperLogLog Deep Dive →