Algorithm Evolution - CardinalityKit Documentation

1985

Flajolet-Martin Estimator

Paper: "Probabilistic counting algorithms for data base applications"

Innovation: First probabilistic cardinality estimation using bit patterns

Key Insight: Use hash function rank (trailing zeros) to estimate cardinality

2003

LogLog Algorithm

Authors: Durand & Flajolet

Innovation: Introduced bucketing to reduce variance

Memory: log(log(N)) bits for N distinct elements

Improvement: Better accuracy through averaging multiple estimates

2007

SuperLogLog

Enhancement: Added outlier elimination

Method: Remove ~30% of largest bucket values

Benefit: Reduced impact of hash collisions and outliers

2007

HyperLogLog

Authors: Flajolet et al.

Innovation: Harmonic mean instead of arithmetic mean

Accuracy: Near-optimal cardinality estimation

Industry Standard: Widely adopted in big data systems

Philippe Flajolet (1948-2011)

Pioneer of analytic combinatorics and probabilistic algorithms. His work laid the foundation for modern cardinality estimation techniques used in big data analytics worldwide.

Flajolet-Martin Estimator (1985)

The foundational algorithm that started it all. Uses a single bitmap to track the maximum rank observed.

def FlajoletMartin(hashes):
    # Initialize 32-bit bitmap
    bitMap = ["0"] * 32
    
    # Process each hash
    for hash_val in hashes:
        # Count trailing zeros (rank)
        rank = 0
        for c in reversed(str(hash_val)):
            if c == "0":
                rank += 1
            else:
                break
        
        # Set corresponding bit to 1
        bitMap[rank] = "1"
    
    # Find first zero position
    r = 0
    for bit in bitMap:
        if bit == "1":
            r += 1
        else:
            break
    
    # Calculate estimate with bias correction
    estimate = 2 ** r
    BIAS = 0.77351
    estimate = estimate / BIAS
    
    return estimate
            

Limitations:

High variance in estimates
Single point of failure
Limited accuracy for practical applications

LogLog Estimator (2003)

Introduced bucketing to reduce variance by averaging multiple independent estimates.

def LogLog(hashes, k):
    m = 2 ** k  # Number of buckets
    buckets = [0] * m
    
    for hash_val in hashes:
        # Get bucket (first k bits)
        j = int(str(hash_val)[:k], 2)
        
        # Get rank from remaining bits
        data = hash_val[k:]
        rank = 1
        for c in reversed(data):
            if c == "0":
                rank += 1
            else:
                break
        
        # Keep maximum rank per bucket
        buckets[j] = max(buckets[j], rank)
    
    # Calculate estimate using arithmetic mean
    average = sum(buckets) / m
    estimate = m * (2 ** average)
    
    # Bias correction
    BIAS = 0.397011808
    estimate = BIAS * estimate
    
    return estimate
            

Key Innovation: Bucketing reduces variance by factor of √m

SuperLogLog Estimator (2007)

Enhanced LogLog with outlier elimination to improve robustness.

def SuperLogLog(hashes, k):
    m = 2 ** k
    buckets = [0] * m
    
    # Same bucket processing as LogLog
    for hash_val in hashes:
        j = int(str(hash_val)[:k], 2)
        data = hash_val[k:]
        rank = 1
        for c in reversed(data):
            if c == "0":
                rank += 1
            else:
                break
        buckets[j] = max(buckets[j], rank)
    
    # Outlier elimination: remove ~30% largest values
    cutoff = 0.7
    lower = math.floor(cutoff * m)
    
    buckets.sort()
    # Remove upper 30% of values
    for i in range(m - lower):
        del buckets[-1]
    
    # Calculate estimate from remaining buckets
    average = sum(buckets) / lower
    estimate = m * 2 ** average
    
    # Empirical bias correction
    BIAS = 0.764
    estimate = BIAS * estimate
    
    return estimate
            

Improvement: More robust against hash collisions and extreme values

HyperLogLog Estimator (2007)

The breakthrough algorithm using harmonic mean for near-optimal accuracy.

def HyperLogLog(hashes, k):
    m = 2 ** k
    buckets = [0] * m
    
    # Process hashes (same as previous algorithms)
    for hash_val in hashes:
        j = int(str(hash_val)[:k], 2)
        data = hash_val[k:]
        rank = 1
        for c in reversed(data):
            if c == "0":
                rank += 1
            else:
                break
        buckets[j] = max(buckets[j], rank)
    
    # Key innovation: harmonic mean
    total = 0
    for bucket in buckets:
        total += 2 ** (-1 * bucket)
    harmonic_mean = total ** -1
    
    estimate = (m ** 2) * harmonic_mean
    
    # Sophisticated bias correction
    if k <= 4:
        BIAS = 0.673
    elif k == 5:
        BIAS = 0.697
    else:
        BIAS = 0.7213 / (1 + (1.079 / m))
    
    estimate = BIAS * estimate
    
    # Range corrections for small and large estimates
    if estimate < ((5 / 2) * m):
        zeros = sum(1 for bucket in buckets if bucket == 0)
        if zeros != 0:
            estimate = m * math.log(m / zeros)
    elif estimate > ((2 ** 32) / 30):
        estimate = -1 * (2 ** 32) * math.log(1 - (estimate / (2 ** 32)))
    
    return estimate
            

Revolutionary Features:

Harmonic mean reduces bias significantly
Sophisticated bias correction formulas
Small and large range corrections
Near-optimal accuracy with minimal memory

Why "LogLog"?

The name comes from the memory requirement: to count cardinalities up to N_max, these algorithms need approximately log(log(N_max)) bits. This means cardinalities in the billions can be estimated using just 1-2 kilobytes of memory!

Performance Comparison

Algorithm	Year	Memory	Accuracy	Key Innovation
Flajolet-Martin	1985	32 bits	High variance	First probabilistic counting
LogLog	2003	log(log(N))	Good	Bucketing for variance reduction
SuperLogLog	2007	log(log(N))	Better	Outlier elimination
HyperLogLog	2007	log(log(N))	Near-optimal	Harmonic mean + range corrections

Real-World Impact

HyperLogLog has become the industry standard for cardinality estimation:

Google: BigQuery, Analytics
Facebook: Audience insights, ad targeting
Amazon: Redshift, DynamoDB
Apache: Spark, Druid, Pinot
Redis: PFADD, PFCOUNT commands

                Why HyperLogLog Won
                Accuracy: Standard error of ~1.04/√m
Memory: Fixed size regardless of cardinality
Mergeable: Sketches can be combined
Parallelizable: Easy to distribute computation

            

Next: HyperLogLog Deep Dive →