HyperLogLog Sketches - CardinalityKit Documentation

HyperLogLog Overview

HyperLogLog (HLL) is the industry standard for cardinality estimation, providing near-optimal accuracy with minimal memory usage. It uses the harmonic mean of bucket values to achieve superior performance compared to earlier algorithms.

🎯 Key Innovation: Harmonic Mean

While LogLog uses arithmetic mean and SuperLogLog uses truncated mean, HyperLogLog uses the harmonic mean of bucket values. This simple change dramatically improves accuracy and reduces bias.

Complete HyperLogLog Implementation

import math, sys

def HyperLogLog(hashes, k):
    """
    HyperLogLog cardinality estimation
    
    Args:
        hashes: List of binary hash strings
        k: Number of bits for bucket indexing (2^k buckets)
    
    Returns:
        estimate: Cardinality estimate
        memory_used: Memory usage in bytes
    """
    # Number of buckets
    m = 2 ** k
    
    # Initialize buckets to 0
    buckets = [0] * m
    
    # Process each hash
    for hash_val in hashes:
        # Get bucket index (first k bits)
        j = int(str(hash_val)[:k], 2)
        
        # Get remaining bits for rank calculation
        data = hash_val[k:]
        
        # Calculate rank (number of leading zeros + 1)
        rank = 1
        for c in reversed(data):
            if c == "0":
                rank += 1
            else:
                break
        
        # Keep maximum rank per bucket
        buckets[j] = max(buckets[j], rank)
    
    # Calculate harmonic mean
    total = 0
    for bucket in buckets:
        total += 2 ** (-1 * bucket)
    
    harmonic_mean = total ** -1
    
    # Raw estimate
    estimate = (m ** 2) * harmonic_mean
    
    # Bias correction
    if k <= 4:
        BIAS = 0.673
    elif k == 5:
        BIAS = 0.697
    else:
        BIAS = 0.7213 / (1 + (1.079 / m))
    
    estimate = BIAS * estimate
    
    # Small range correction
    if estimate < ((5 / 2) * m):
        zeros = sum(1 for bucket in buckets if bucket == 0)
        if zeros != 0:
            estimate = m * math.log(m / zeros)
    
    # Large range correction
    elif estimate > ((2 ** 32) / 30):
        estimate = -1 * (2 ** 32) * math.log(1 - (estimate / (2 ** 32)))
    
    return estimate, sys.getsizeof(buckets)
            

Step-by-Step Algorithm Walkthrough

Example: Processing Hash "11000101..."

Bucket Assignment (k=3): First 3 bits "110" → bucket 6
Rank Calculation: Remaining bits "00101..." → count trailing zeros
Trailing Zeros: "...101" has 0 trailing zeros → rank = 1
Update Bucket: buckets[6] = max(buckets[6], 1)

Bucket State Example (k=3, 8 buckets):

6

1

0

11

8

5

7

9

Harmonic Mean Calculation:

1/((2^-6 + 2^-1 + 2^0 + 2^-11 + 2^-8 + 2^-5 + 2^-7 + 2^-9) / 8)

Mathematical Foundation

HyperLogLog Formula:

E = α_m × m² × (Σ 2^(-M[j]))^(-1)

where:

α_m = bias correction factor
m = number of buckets (2^k)
M[j] = maximum rank in bucket j

🔬 Why Harmonic Mean Works

The harmonic mean is more sensitive to small values than arithmetic mean. In cardinality estimation, small bucket values (indicating fewer unique elements) should have more influence on the final estimate. This mathematical property makes HLL more accurate than its predecessors.

Bias Correction

HyperLogLog applies sophisticated bias correction based on the number of buckets:

def calculate_bias_correction(k, m):
    """Calculate bias correction factor for HyperLogLog"""
    
    if k <= 4:
        # Pre-calculated values for small k
        return 0.673
    elif k == 5:
        return 0.697
    else:
        # Formula for k >= 6
        return 0.7213 / (1 + (1.079 / m))

# Example bias factors:
# k=4 (16 buckets): α = 0.673
# k=8 (256 buckets): α = 0.7197
# k=12 (4096 buckets): α = 0.7213
# k=16 (65536 buckets): α = 0.7213
            

Bias Correction Impact:

Without correction: Systematic overestimation of ~39%
With correction: Unbiased estimates within statistical bounds
Accuracy improvement: From ~40% error to <2% error

Range Corrections

def apply_range_corrections(estimate, m, buckets):
    """Apply small and large range corrections"""
    
    # Small range correction (estimate < 2.5m)
    if estimate < ((5 / 2) * m):
        zeros = sum(1 for bucket in buckets if bucket == 0)
        if zeros != 0:
            # Use linear counting for small ranges
            estimate = m * math.log(m / zeros)
            print(f"Applied small range correction: {estimate}")
    
    # Large range correction (estimate > 2^32/30)
    elif estimate > ((2 ** 32) / 30):
        # Correct for hash collision effects
        estimate = -1 * (2 ** 32) * math.log(1 - (estimate / (2 ** 32)))
        print(f"Applied large range correction: {estimate}")
    
    return estimate
            

                🎯 When Range Corrections Apply
                Small Range: When many buckets are empty (estimate < 2.5m)
Large Range: When hash collisions become significant (estimate > 143M)
Normal Range: Standard HLL formula works optimally

            

Memory Usage Analysis

def analyze_memory_usage():
    """Analyze HyperLogLog memory requirements"""
    
    memory_analysis = {}
    
    for k in range(4, 20):
        m = 2 ** k
        
        # Each bucket stores one integer (rank value)
        # Python int typically uses 28 bytes per integer
        memory_bytes = m * 28
        memory_kb = memory_bytes / 1024
        memory_mb = memory_kb / 1024
        
        # Theoretical accuracy (standard error)
        standard_error = 1.04 / math.sqrt(m)
        
        memory_analysis[k] = {
            'buckets': m,
            'memory_bytes': memory_bytes,
            'memory_kb': round(memory_kb, 2),
            'memory_mb': round(memory_mb, 2),
            'standard_error': round(standard_error * 100, 2)  # As percentage
        }
    
    return memory_analysis

# Example output:
# k=8:  256 buckets, 7KB, 6.25% error
# k=12: 4096 buckets, 112KB, 1.56% error  
# k=14: 16384 buckets, 448KB, 0.78% error
# k=16: 65536 buckets, 1.75MB, 0.39% error
            

Practical Implementation from CardinalityKit

# From CardinalityKit's HyperLogLogEstimator.py
import math, sys

def HyperLogLog(hashes, k):
    m = 2 ** k
    buckets = [0] * m
    
    for i in range(0, len(hashes)):
        # Get bucket (first k bits)
        j = str(hashes[i])[:k]
        j = int(j, 2)    
        
        # Get remaining bits
        data = hashes[i][k:]
        
        # Calculate rank (trailing zeros + 1)
        rank = 1
        for c in reversed(data):
            if c == "0":
                rank += 1
            else:
                break
        
        # Update bucket with maximum rank
        buckets[j] = max(buckets[j], rank)
    
    # Harmonic mean calculation
    total = 0
    for bucket in buckets:
        total += 2 ** (-1 * bucket)
    mean = total ** -1
    
    # Raw estimate
    estimate = (m ** 2) * mean
    
    # Bias correction
    if k <= 4:
        BIAS = 0.673
    elif k == 5:
        BIAS = 0.697
    else:
        BIAS = 0.7213 / (1 + (1.079 / m))
    
    estimate = BIAS * estimate
    
    # Small range correction
    if estimate < ((5 / 2) * m):
        zeros = 0
        for i in buckets:
            if buckets[i] == 0:  # Note: This is a bug in original code
                zeros += 1
        if not zeros == 0:
            estimate = m * math.log(estimate, 2)  # Should be log(m/zeros)
    
    # Large range correction
    elif estimate > ((2 ** 32) / 30):
        estimate = -1 * (2 ** 32) * math.log(1 - (estimate / (2 ** 32)))
    
    return estimate, sys.getsizeof(buckets)
            

⚠️ Implementation Notes

The original CardinalityKit implementation has a small bug in the small range correction. The corrected version should use log(m/zeros) instead of log(estimate, 2).

Performance Characteristics

HyperLogLog Advantages:

Memory Efficient: O(log log n) space complexity
Fast Processing: O(1) per element insertion
Mergeable: Sketches can be combined with MAX operation
Parallelizable: Easy to distribute across multiple machines
Accurate: Standard error of ~1.04/√m

Limitations:

Biased: Requires complex bias correction
Range Dependent: Needs different formulas for different ranges
Integer Arithmetic: Rank calculations can be imprecise

Industry Applications

                🏢 Real-World Usage
                Google BigQuery: APPROX_COUNT_DISTINCT function
Amazon Redshift: Approximate cardinality queries
Apache Spark: approxCountDistinct in DataFrames
Redis: PFADD/PFCOUNT commands
Presto/Trino: approx_distinct function

            

-- SQL examples using HyperLogLog
-- BigQuery
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;

-- Redshift  
SELECT APPROXIMATE COUNT(DISTINCT user_id) FROM events;

-- Spark SQL
SELECT approx_count_distinct(user_id) FROM events;

-- All use HyperLogLog under the hood!