HyperReal Algorithm - CardinalityKit Documentation

What is HyperReal?

HyperReal (HR) is an innovative cardinality estimation algorithm that provides an unbiased alternative to HyperLogLog. While maintaining the same memory efficiency and privacy benefits, HyperReal eliminates the systematic bias present in HLL estimates.

🎯 Key Innovation

Instead of counting trailing zeros like HLL, HyperReal uses minimum hash values normalized to the range [0,1]. This approach provides mathematically unbiased estimates while preserving all the benefits of sketch-based cardinality estimation.

From HyperLogLog to HyperReal

🔵 HyperLogLog Approach

Count trailing zeros in hash
Store maximum zero count per bucket
Use harmonic mean for estimation
Apply bias correction factors
Requires range corrections

🟢 HyperReal Approach

Normalize hash to [0,1] range
Store minimum value per bucket
Use direct mathematical formula
Inherently unbiased
No corrections needed

Mathematical Foundation

HyperReal Core Principle:

For uniform hash function h: U → [0,1]

M[j] ~ Beta(1, n/m) where n = cardinality, m = buckets

For large n: M[j] ~ (m/n) × Exp(1)

Estimate = m² / Σ(M[j])

Why This Works:

Uniform Distribution: Hash function maps inputs uniformly to [0,1]
Minimum Statistics: Minimum of uniform random variables follows known distribution
Beta Distribution: M[j] follows Beta(1, n/m) distribution
Exponential Approximation: For large n, Beta approximates exponential distribution
Unbiased Estimation: Mathematical expectation equals true cardinality

HyperReal Implementation

def HyperReal(hashes, k):
    m = 2 ** k  # Number of buckets
    buckets = [1.0] * m  # Initialize to 1.0 (maximum possible value)
    
    for hash_val in hashes:
        # Normalize hash to [0,1] range
        int_val = int(hashlib.sha256(str(hash_val).encode()).hexdigest()[:8], 16)
        x = int_val / (16 ** 8 - 1)  # Normalize to [0,1]
        
        # Get bucket index (first k bits)
        j = int(str(hash_val)[:k], 2)
        
        # Keep minimum value per bucket
        buckets[j] = min(buckets[j], x)
    
    # Calculate unbiased estimate
    estimate = (m ** 2) / sum(buckets)
    
    return estimate
            

Key Differences from HLL:

Initialize buckets to 1.0 instead of 0
Store minimum hash values instead of maximum zero counts
Use sum instead of harmonic mean
No bias correction needed

Enhanced HyperReal with Double Hashing

CardinalityKit implementation uses double hashing for increased entropy:

class HyperRealSketch:
    def __init__(self, b_m):
        self.b_m = b_m
        self.m = 2 ** b_m
        self.registers = [1.0] * self.m
    
    def _hash_function(self, data):
        # Double hashing for increased entropy
        inner_hash = hashlib.sha256(str(data).encode('utf-8')).hexdigest()
        outer_hash = hashlib.sha256(str(inner_hash).encode('utf-8')).hexdigest()
        
        # Convert to 256-bit binary
        hashed_ = '{:256b}'.format(int(outer_hash, 16)).replace(' ', '0')
        return hashed_
    
    def update_sketch(self, user_id):
        x = self._hash_function(user_id)
        j = int(str(x)[:self.b_m], 2)  # Bucket index
        
        # Normalize hash to [0,1]
        int_val = int(hashlib.sha256(str(user_id).encode()).hexdigest()[:8], 16)
        w = int_val / (16 ** 8 - 1)
        
        # Keep minimum value
        self.registers[j] = min(self.registers[j], w)
    
    def get_cardinality_estimate(self):
        return (self.m ** 2) / sum(self.registers)
            

Bucket Visualization

HyperLogLog vs HyperReal Buckets:

HyperLogLog Buckets (Zero Counts):

6

1

0

11

8

5

7

9

HyperReal Buckets (Minimum Values):

0.08

0.11

0.57

0.03

0.63

0.82

0.18

0.12

Performance Comparison: HLL vs HR

# Simulation results from CardinalityKit
# Using NYC Parking Violations dataset

def compare_hll_hr():
    results = {}
    
    for k in range(2, 20):  # Different bucket sizes
        hll_estimates = []
        hr_estimates = []
        
        for iteration in range(8):  # Multiple runs
            # Generate test hashes with different offsets
            hashes = generate_test_hashes(iteration)
            
            # Run both algorithms
            hll_estimate = HyperLogLog(hashes, k)
            hr_estimate = HyperReal(hashes, k)
            
            hll_estimates.append(hll_estimate)
            hr_estimates.append(hr_estimate)
        
        # Calculate statistics
        results[k] = {
            'HLL_mean': statistics.mean(hll_estimates),
            'HR_mean': statistics.mean(hr_estimates),
            'HLL_error': abs(exact_count - statistics.mean(hll_estimates)) / exact_count,
            'HR_error': abs(exact_count - statistics.mean(hr_estimates)) / exact_count
        }
    
    return results
            

🎯 HyperReal Advantages:

Unbiased Estimates: No systematic over/under-estimation
Simpler Mathematics: No complex bias correction formulas
Better Accuracy: Especially for smaller cardinalities
Consistent Performance: More predictable across different data distributions
Easier Implementation: Fewer edge cases and corrections

Sketch Operations

Like HyperLogLog, HyperReal sketches support union and intersection operations:

def merge_hyperreal_union(sketch1, sketch2):
    """Merge two HyperReal sketches for UNION operation"""
    merged_registers = []
    
    for i in range(len(sketch1.registers)):
        # Take minimum value for union
        merged_registers.append(min(sketch1.registers[i], sketch2.registers[i]))
    
    return merged_registers

def merge_hyperreal_intersection(sketch1, sketch2):
    """Merge two HyperReal sketches for INTERSECTION operation"""
    merged_registers = []
    
    for i in range(len(sketch1.registers)):
        # Take maximum value for intersection
        merged_registers.append(max(sketch1.registers[i], sketch2.registers[i]))
    
    return merged_registers
            

Merge Operations Visualization:

Union (MIN operation):

Sketch A: [0.6, 0.1, 0.8, 0.5] + Sketch B: [0.11, 0.5, 0.10, 0.1] = [0.11, 0.1, 0.10, 0.1]

Intersection (MAX operation):

Sketch A: [0.6, 0.1, 0.8, 0.5] + Sketch B: [0.11, 0.5, 0.10, 0.1] = [0.6, 0.5, 0.8, 0.5]

When to Use HyperReal

                🎯 Ideal Use Cases:
                High Accuracy Requirements: When bias matters for business decisions
Small to Medium Cardinalities: HR shows better accuracy for smaller datasets
Research Applications: When mathematical properties need to be well-understood
Cross-Platform Analytics: Consistent behavior across different data sources
Real-Time Systems: Simpler implementation reduces computational overhead

            

Performance Characteristics:

Memory Usage: Same as HLL - log(log(N)) bits
Computational Complexity: O(1) per element, same as HLL
Accuracy: Unbiased estimates, often better than HLL
Mergeability: Full support for union/intersection operations
Parallelization: Easily distributable across multiple machines

Implementation Best Practices

# Best practices for HyperReal implementation

class OptimizedHyperReal:
    def __init__(self, precision=14):
        self.precision = precision
        self.m = 2 ** precision
        self.registers = [1.0] * self.m
        
    def add(self, element):
        """Add element to sketch"""
        # Use consistent hashing
        hash_val = self._consistent_hash(element)
        bucket = self._get_bucket(hash_val)
        normalized_val = self._normalize_hash(element)
        
        self.registers[bucket] = min(self.registers[bucket], normalized_val)
    
    def _consistent_hash(self, element):
        """Ensure consistent hashing across different runs"""
        return hashlib.sha256(str(element).encode('utf-8')).hexdigest()
    
    def _normalize_hash(self, element):
        """Normalize hash to [0,1] range"""
        hash_int = int(hashlib.sha256(str(element).encode()).hexdigest()[:8], 16)
        return hash_int / (16 ** 8 - 1)
    
    def cardinality(self):
        """Get cardinality estimate"""
        return (self.m ** 2) / sum(self.registers)
    
    def merge(self, other_sketch):
        """Merge with another HyperReal sketch"""
        for i in range(self.m):
            self.registers[i] = min(self.registers[i], other_sketch.registers[i])