HyperLogLog Overview
HyperLogLog (HLL) is the industry standard for cardinality estimation, providing near-optimal accuracy with minimal memory usage. It uses the harmonic mean of bucket values to achieve superior performance compared to earlier algorithms.
🎯 Key Innovation: Harmonic Mean
While LogLog uses arithmetic mean and SuperLogLog uses truncated mean, HyperLogLog uses the harmonic mean of bucket values. This simple change dramatically improves accuracy and reduces bias.
Complete HyperLogLog Implementation
import math, sys
def HyperLogLog(hashes, k):
"""
HyperLogLog cardinality estimation
Args:
hashes: List of binary hash strings
k: Number of bits for bucket indexing (2^k buckets)
Returns:
estimate: Cardinality estimate
memory_used: Memory usage in bytes
"""
# Number of buckets
m = 2 ** k
# Initialize buckets to 0
buckets = [0] * m
# Process each hash
for hash_val in hashes:
# Get bucket index (first k bits)
j = int(str(hash_val)[:k], 2)
# Get remaining bits for rank calculation
data = hash_val[k:]
# Calculate rank (number of leading zeros + 1)
rank = 1
for c in reversed(data):
if c == "0":
rank += 1
else:
break
# Keep maximum rank per bucket
buckets[j] = max(buckets[j], rank)
# Calculate harmonic mean
total = 0
for bucket in buckets:
total += 2 ** (-1 * bucket)
harmonic_mean = total ** -1
# Raw estimate
estimate = (m ** 2) * harmonic_mean
# Bias correction
if k <= 4:
BIAS = 0.673
elif k == 5:
BIAS = 0.697
else:
BIAS = 0.7213 / (1 + (1.079 / m))
estimate = BIAS * estimate
# Small range correction
if estimate < ((5 / 2) * m):
zeros = sum(1 for bucket in buckets if bucket == 0)
if zeros != 0:
estimate = m * math.log(m / zeros)
# Large range correction
elif estimate > ((2 ** 32) / 30):
estimate = -1 * (2 ** 32) * math.log(1 - (estimate / (2 ** 32)))
return estimate, sys.getsizeof(buckets)
Step-by-Step Algorithm Walkthrough
Example: Processing Hash "11000101..."
- Bucket Assignment (k=3): First 3 bits "110" → bucket 6
- Rank Calculation: Remaining bits "00101..." → count trailing zeros
- Trailing Zeros: "...101" has 0 trailing zeros → rank = 1
- Update Bucket: buckets[6] = max(buckets[6], 1)
Bucket State Example (k=3, 8 buckets):
Harmonic Mean Calculation:
1/((2^-6 + 2^-1 + 2^0 + 2^-11 + 2^-8 + 2^-5 + 2^-7 + 2^-9) / 8)
Mathematical Foundation
🔬 Why Harmonic Mean Works
The harmonic mean is more sensitive to small values than arithmetic mean. In cardinality estimation, small bucket values (indicating fewer unique elements) should have more influence on the final estimate. This mathematical property makes HLL more accurate than its predecessors.
Bias Correction
HyperLogLog applies sophisticated bias correction based on the number of buckets:
def calculate_bias_correction(k, m):
"""Calculate bias correction factor for HyperLogLog"""
if k <= 4:
# Pre-calculated values for small k
return 0.673
elif k == 5:
return 0.697
else:
# Formula for k >= 6
return 0.7213 / (1 + (1.079 / m))
# Example bias factors:
# k=4 (16 buckets): α = 0.673
# k=8 (256 buckets): α = 0.7197
# k=12 (4096 buckets): α = 0.7213
# k=16 (65536 buckets): α = 0.7213
Bias Correction Impact:
- Without correction: Systematic overestimation of ~39%
- With correction: Unbiased estimates within statistical bounds
- Accuracy improvement: From ~40% error to <2% error
Range Corrections
def apply_range_corrections(estimate, m, buckets):
"""Apply small and large range corrections"""
# Small range correction (estimate < 2.5m)
if estimate < ((5 / 2) * m):
zeros = sum(1 for bucket in buckets if bucket == 0)
if zeros != 0:
# Use linear counting for small ranges
estimate = m * math.log(m / zeros)
print(f"Applied small range correction: {estimate}")
# Large range correction (estimate > 2^32/30)
elif estimate > ((2 ** 32) / 30):
# Correct for hash collision effects
estimate = -1 * (2 ** 32) * math.log(1 - (estimate / (2 ** 32)))
print(f"Applied large range correction: {estimate}")
return estimate
🎯 When Range Corrections Apply
- Small Range: When many buckets are empty (estimate < 2.5m)
- Large Range: When hash collisions become significant (estimate > 143M)
- Normal Range: Standard HLL formula works optimally
Memory Usage Analysis
def analyze_memory_usage():
"""Analyze HyperLogLog memory requirements"""
memory_analysis = {}
for k in range(4, 20):
m = 2 ** k
# Each bucket stores one integer (rank value)
# Python int typically uses 28 bytes per integer
memory_bytes = m * 28
memory_kb = memory_bytes / 1024
memory_mb = memory_kb / 1024
# Theoretical accuracy (standard error)
standard_error = 1.04 / math.sqrt(m)
memory_analysis[k] = {
'buckets': m,
'memory_bytes': memory_bytes,
'memory_kb': round(memory_kb, 2),
'memory_mb': round(memory_mb, 2),
'standard_error': round(standard_error * 100, 2) # As percentage
}
return memory_analysis
# Example output:
# k=8: 256 buckets, 7KB, 6.25% error
# k=12: 4096 buckets, 112KB, 1.56% error
# k=14: 16384 buckets, 448KB, 0.78% error
# k=16: 65536 buckets, 1.75MB, 0.39% error
Practical Implementation from CardinalityKit
# From CardinalityKit's HyperLogLogEstimator.py
import math, sys
def HyperLogLog(hashes, k):
m = 2 ** k
buckets = [0] * m
for i in range(0, len(hashes)):
# Get bucket (first k bits)
j = str(hashes[i])[:k]
j = int(j, 2)
# Get remaining bits
data = hashes[i][k:]
# Calculate rank (trailing zeros + 1)
rank = 1
for c in reversed(data):
if c == "0":
rank += 1
else:
break
# Update bucket with maximum rank
buckets[j] = max(buckets[j], rank)
# Harmonic mean calculation
total = 0
for bucket in buckets:
total += 2 ** (-1 * bucket)
mean = total ** -1
# Raw estimate
estimate = (m ** 2) * mean
# Bias correction
if k <= 4:
BIAS = 0.673
elif k == 5:
BIAS = 0.697
else:
BIAS = 0.7213 / (1 + (1.079 / m))
estimate = BIAS * estimate
# Small range correction
if estimate < ((5 / 2) * m):
zeros = 0
for i in buckets:
if buckets[i] == 0: # Note: This is a bug in original code
zeros += 1
if not zeros == 0:
estimate = m * math.log(estimate, 2) # Should be log(m/zeros)
# Large range correction
elif estimate > ((2 ** 32) / 30):
estimate = -1 * (2 ** 32) * math.log(1 - (estimate / (2 ** 32)))
return estimate, sys.getsizeof(buckets)
⚠️ Implementation Notes
The original CardinalityKit implementation has a small bug in the small range correction. The corrected version should use log(m/zeros) instead of log(estimate, 2).
Performance Characteristics
HyperLogLog Advantages:
- Memory Efficient: O(log log n) space complexity
- Fast Processing: O(1) per element insertion
- Mergeable: Sketches can be combined with MAX operation
- Parallelizable: Easy to distribute across multiple machines
- Accurate: Standard error of ~1.04/√m
Limitations:
- Biased: Requires complex bias correction
- Range Dependent: Needs different formulas for different ranges
- Integer Arithmetic: Rank calculations can be imprecise
Industry Applications
🏢 Real-World Usage
- Google BigQuery: APPROX_COUNT_DISTINCT function
- Amazon Redshift: Approximate cardinality queries
- Apache Spark: approxCountDistinct in DataFrames
- Redis: PFADD/PFCOUNT commands
- Presto/Trino: approx_distinct function
-- SQL examples using HyperLogLog
-- BigQuery
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;
-- Redshift
SELECT APPROXIMATE COUNT(DISTINCT user_id) FROM events;
-- Spark SQL
SELECT approx_count_distinct(user_id) FROM events;
-- All use HyperLogLog under the hood!