HyperLogLog Sketches

Deep Dive into the Industry Standard Algorithm

CardinalityKit Documentation

HyperLogLog Overview

HyperLogLog (HLL) is the industry standard for cardinality estimation, providing near-optimal accuracy with minimal memory usage. It uses the harmonic mean of bucket values to achieve superior performance compared to earlier algorithms.

🎯 Key Innovation: Harmonic Mean

While LogLog uses arithmetic mean and SuperLogLog uses truncated mean, HyperLogLog uses the harmonic mean of bucket values. This simple change dramatically improves accuracy and reduces bias.

Complete HyperLogLog Implementation

import math, sys def HyperLogLog(hashes, k): """ HyperLogLog cardinality estimation Args: hashes: List of binary hash strings k: Number of bits for bucket indexing (2^k buckets) Returns: estimate: Cardinality estimate memory_used: Memory usage in bytes """ # Number of buckets m = 2 ** k # Initialize buckets to 0 buckets = [0] * m # Process each hash for hash_val in hashes: # Get bucket index (first k bits) j = int(str(hash_val)[:k], 2) # Get remaining bits for rank calculation data = hash_val[k:] # Calculate rank (number of leading zeros + 1) rank = 1 for c in reversed(data): if c == "0": rank += 1 else: break # Keep maximum rank per bucket buckets[j] = max(buckets[j], rank) # Calculate harmonic mean total = 0 for bucket in buckets: total += 2 ** (-1 * bucket) harmonic_mean = total ** -1 # Raw estimate estimate = (m ** 2) * harmonic_mean # Bias correction if k <= 4: BIAS = 0.673 elif k == 5: BIAS = 0.697 else: BIAS = 0.7213 / (1 + (1.079 / m)) estimate = BIAS * estimate # Small range correction if estimate < ((5 / 2) * m): zeros = sum(1 for bucket in buckets if bucket == 0) if zeros != 0: estimate = m * math.log(m / zeros) # Large range correction elif estimate > ((2 ** 32) / 30): estimate = -1 * (2 ** 32) * math.log(1 - (estimate / (2 ** 32))) return estimate, sys.getsizeof(buckets)

Step-by-Step Algorithm Walkthrough

Example: Processing Hash "11000101..."

  1. Bucket Assignment (k=3): First 3 bits "110" → bucket 6
  2. Rank Calculation: Remaining bits "00101..." → count trailing zeros
  3. Trailing Zeros: "...101" has 0 trailing zeros → rank = 1
  4. Update Bucket: buckets[6] = max(buckets[6], 1)

Bucket State Example (k=3, 8 buckets):

6
1
0
11
8
5
7
9

Harmonic Mean Calculation:

1/((2^-6 + 2^-1 + 2^0 + 2^-11 + 2^-8 + 2^-5 + 2^-7 + 2^-9) / 8)

Mathematical Foundation

HyperLogLog Formula:

E = αm × m² × (Σ 2^(-M[j]))^(-1)

where:

  • αm = bias correction factor
  • m = number of buckets (2^k)
  • M[j] = maximum rank in bucket j

🔬 Why Harmonic Mean Works

The harmonic mean is more sensitive to small values than arithmetic mean. In cardinality estimation, small bucket values (indicating fewer unique elements) should have more influence on the final estimate. This mathematical property makes HLL more accurate than its predecessors.

Bias Correction

HyperLogLog applies sophisticated bias correction based on the number of buckets:

def calculate_bias_correction(k, m): """Calculate bias correction factor for HyperLogLog""" if k <= 4: # Pre-calculated values for small k return 0.673 elif k == 5: return 0.697 else: # Formula for k >= 6 return 0.7213 / (1 + (1.079 / m)) # Example bias factors: # k=4 (16 buckets): α = 0.673 # k=8 (256 buckets): α = 0.7197 # k=12 (4096 buckets): α = 0.7213 # k=16 (65536 buckets): α = 0.7213

Bias Correction Impact:

  • Without correction: Systematic overestimation of ~39%
  • With correction: Unbiased estimates within statistical bounds
  • Accuracy improvement: From ~40% error to <2% error

Range Corrections

def apply_range_corrections(estimate, m, buckets): """Apply small and large range corrections""" # Small range correction (estimate < 2.5m) if estimate < ((5 / 2) * m): zeros = sum(1 for bucket in buckets if bucket == 0) if zeros != 0: # Use linear counting for small ranges estimate = m * math.log(m / zeros) print(f"Applied small range correction: {estimate}") # Large range correction (estimate > 2^32/30) elif estimate > ((2 ** 32) / 30): # Correct for hash collision effects estimate = -1 * (2 ** 32) * math.log(1 - (estimate / (2 ** 32))) print(f"Applied large range correction: {estimate}") return estimate

🎯 When Range Corrections Apply

  • Small Range: When many buckets are empty (estimate < 2.5m)
  • Large Range: When hash collisions become significant (estimate > 143M)
  • Normal Range: Standard HLL formula works optimally

Memory Usage Analysis

def analyze_memory_usage(): """Analyze HyperLogLog memory requirements""" memory_analysis = {} for k in range(4, 20): m = 2 ** k # Each bucket stores one integer (rank value) # Python int typically uses 28 bytes per integer memory_bytes = m * 28 memory_kb = memory_bytes / 1024 memory_mb = memory_kb / 1024 # Theoretical accuracy (standard error) standard_error = 1.04 / math.sqrt(m) memory_analysis[k] = { 'buckets': m, 'memory_bytes': memory_bytes, 'memory_kb': round(memory_kb, 2), 'memory_mb': round(memory_mb, 2), 'standard_error': round(standard_error * 100, 2) # As percentage } return memory_analysis # Example output: # k=8: 256 buckets, 7KB, 6.25% error # k=12: 4096 buckets, 112KB, 1.56% error # k=14: 16384 buckets, 448KB, 0.78% error # k=16: 65536 buckets, 1.75MB, 0.39% error

Practical Implementation from CardinalityKit

# From CardinalityKit's HyperLogLogEstimator.py import math, sys def HyperLogLog(hashes, k): m = 2 ** k buckets = [0] * m for i in range(0, len(hashes)): # Get bucket (first k bits) j = str(hashes[i])[:k] j = int(j, 2) # Get remaining bits data = hashes[i][k:] # Calculate rank (trailing zeros + 1) rank = 1 for c in reversed(data): if c == "0": rank += 1 else: break # Update bucket with maximum rank buckets[j] = max(buckets[j], rank) # Harmonic mean calculation total = 0 for bucket in buckets: total += 2 ** (-1 * bucket) mean = total ** -1 # Raw estimate estimate = (m ** 2) * mean # Bias correction if k <= 4: BIAS = 0.673 elif k == 5: BIAS = 0.697 else: BIAS = 0.7213 / (1 + (1.079 / m)) estimate = BIAS * estimate # Small range correction if estimate < ((5 / 2) * m): zeros = 0 for i in buckets: if buckets[i] == 0: # Note: This is a bug in original code zeros += 1 if not zeros == 0: estimate = m * math.log(estimate, 2) # Should be log(m/zeros) # Large range correction elif estimate > ((2 ** 32) / 30): estimate = -1 * (2 ** 32) * math.log(1 - (estimate / (2 ** 32))) return estimate, sys.getsizeof(buckets)

⚠️ Implementation Notes

The original CardinalityKit implementation has a small bug in the small range correction. The corrected version should use log(m/zeros) instead of log(estimate, 2).

Performance Characteristics

HyperLogLog Advantages:

  • Memory Efficient: O(log log n) space complexity
  • Fast Processing: O(1) per element insertion
  • Mergeable: Sketches can be combined with MAX operation
  • Parallelizable: Easy to distribute across multiple machines
  • Accurate: Standard error of ~1.04/√m

Limitations:

  • Biased: Requires complex bias correction
  • Range Dependent: Needs different formulas for different ranges
  • Integer Arithmetic: Rank calculations can be imprecise

Industry Applications

🏢 Real-World Usage

  • Google BigQuery: APPROX_COUNT_DISTINCT function
  • Amazon Redshift: Approximate cardinality queries
  • Apache Spark: approxCountDistinct in DataFrames
  • Redis: PFADD/PFCOUNT commands
  • Presto/Trino: approx_distinct function
-- SQL examples using HyperLogLog -- BigQuery SELECT APPROX_COUNT_DISTINCT(user_id) FROM events; -- Redshift SELECT APPROXIMATE COUNT(DISTINCT user_id) FROM events; -- Spark SQL SELECT approx_count_distinct(user_id) FROM events; -- All use HyperLogLog under the hood!
Next: HyperReal Algorithm →