HyperReal Algorithm

The Unbiased Alternative to HyperLogLog

CardinalityKit Documentation

What is HyperReal?

HyperReal (HR) is an innovative cardinality estimation algorithm that provides an unbiased alternative to HyperLogLog. While maintaining the same memory efficiency and privacy benefits, HyperReal eliminates the systematic bias present in HLL estimates.

🎯 Key Innovation

Instead of counting trailing zeros like HLL, HyperReal uses minimum hash values normalized to the range [0,1]. This approach provides mathematically unbiased estimates while preserving all the benefits of sketch-based cardinality estimation.

From HyperLogLog to HyperReal

🔵 HyperLogLog Approach

  • Count trailing zeros in hash
  • Store maximum zero count per bucket
  • Use harmonic mean for estimation
  • Apply bias correction factors
  • Requires range corrections

🟢 HyperReal Approach

  • Normalize hash to [0,1] range
  • Store minimum value per bucket
  • Use direct mathematical formula
  • Inherently unbiased
  • No corrections needed

Mathematical Foundation

HyperReal Core Principle:

For uniform hash function h: U → [0,1]

M[j] ~ Beta(1, n/m) where n = cardinality, m = buckets

For large n: M[j] ~ (m/n) × Exp(1)

Estimate = m² / Σ(M[j])

Why This Works:

  1. Uniform Distribution: Hash function maps inputs uniformly to [0,1]
  2. Minimum Statistics: Minimum of uniform random variables follows known distribution
  3. Beta Distribution: M[j] follows Beta(1, n/m) distribution
  4. Exponential Approximation: For large n, Beta approximates exponential distribution
  5. Unbiased Estimation: Mathematical expectation equals true cardinality

HyperReal Implementation

def HyperReal(hashes, k): m = 2 ** k # Number of buckets buckets = [1.0] * m # Initialize to 1.0 (maximum possible value) for hash_val in hashes: # Normalize hash to [0,1] range int_val = int(hashlib.sha256(str(hash_val).encode()).hexdigest()[:8], 16) x = int_val / (16 ** 8 - 1) # Normalize to [0,1] # Get bucket index (first k bits) j = int(str(hash_val)[:k], 2) # Keep minimum value per bucket buckets[j] = min(buckets[j], x) # Calculate unbiased estimate estimate = (m ** 2) / sum(buckets) return estimate

Key Differences from HLL:

  • Initialize buckets to 1.0 instead of 0
  • Store minimum hash values instead of maximum zero counts
  • Use sum instead of harmonic mean
  • No bias correction needed

Enhanced HyperReal with Double Hashing

CardinalityKit implementation uses double hashing for increased entropy:

class HyperRealSketch: def __init__(self, b_m): self.b_m = b_m self.m = 2 ** b_m self.registers = [1.0] * self.m def _hash_function(self, data): # Double hashing for increased entropy inner_hash = hashlib.sha256(str(data).encode('utf-8')).hexdigest() outer_hash = hashlib.sha256(str(inner_hash).encode('utf-8')).hexdigest() # Convert to 256-bit binary hashed_ = '{:256b}'.format(int(outer_hash, 16)).replace(' ', '0') return hashed_ def update_sketch(self, user_id): x = self._hash_function(user_id) j = int(str(x)[:self.b_m], 2) # Bucket index # Normalize hash to [0,1] int_val = int(hashlib.sha256(str(user_id).encode()).hexdigest()[:8], 16) w = int_val / (16 ** 8 - 1) # Keep minimum value self.registers[j] = min(self.registers[j], w) def get_cardinality_estimate(self): return (self.m ** 2) / sum(self.registers)

Bucket Visualization

HyperLogLog vs HyperReal Buckets:

HyperLogLog Buckets (Zero Counts):

6
1
0
11
8
5
7
9

HyperReal Buckets (Minimum Values):

0.08
0.11
0.57
0.03
0.63
0.82
0.18
0.12

Performance Comparison: HLL vs HR

# Simulation results from CardinalityKit # Using NYC Parking Violations dataset def compare_hll_hr(): results = {} for k in range(2, 20): # Different bucket sizes hll_estimates = [] hr_estimates = [] for iteration in range(8): # Multiple runs # Generate test hashes with different offsets hashes = generate_test_hashes(iteration) # Run both algorithms hll_estimate = HyperLogLog(hashes, k) hr_estimate = HyperReal(hashes, k) hll_estimates.append(hll_estimate) hr_estimates.append(hr_estimate) # Calculate statistics results[k] = { 'HLL_mean': statistics.mean(hll_estimates), 'HR_mean': statistics.mean(hr_estimates), 'HLL_error': abs(exact_count - statistics.mean(hll_estimates)) / exact_count, 'HR_error': abs(exact_count - statistics.mean(hr_estimates)) / exact_count } return results

🎯 HyperReal Advantages:

  • Unbiased Estimates: No systematic over/under-estimation
  • Simpler Mathematics: No complex bias correction formulas
  • Better Accuracy: Especially for smaller cardinalities
  • Consistent Performance: More predictable across different data distributions
  • Easier Implementation: Fewer edge cases and corrections

Sketch Operations

Like HyperLogLog, HyperReal sketches support union and intersection operations:

def merge_hyperreal_union(sketch1, sketch2): """Merge two HyperReal sketches for UNION operation""" merged_registers = [] for i in range(len(sketch1.registers)): # Take minimum value for union merged_registers.append(min(sketch1.registers[i], sketch2.registers[i])) return merged_registers def merge_hyperreal_intersection(sketch1, sketch2): """Merge two HyperReal sketches for INTERSECTION operation""" merged_registers = [] for i in range(len(sketch1.registers)): # Take maximum value for intersection merged_registers.append(max(sketch1.registers[i], sketch2.registers[i])) return merged_registers

Merge Operations Visualization:

Union (MIN operation):

Sketch A: [0.6, 0.1, 0.8, 0.5] + Sketch B: [0.11, 0.5, 0.10, 0.1] = [0.11, 0.1, 0.10, 0.1]

Intersection (MAX operation):

Sketch A: [0.6, 0.1, 0.8, 0.5] + Sketch B: [0.11, 0.5, 0.10, 0.1] = [0.6, 0.5, 0.8, 0.5]

When to Use HyperReal

🎯 Ideal Use Cases:

  • High Accuracy Requirements: When bias matters for business decisions
  • Small to Medium Cardinalities: HR shows better accuracy for smaller datasets
  • Research Applications: When mathematical properties need to be well-understood
  • Cross-Platform Analytics: Consistent behavior across different data sources
  • Real-Time Systems: Simpler implementation reduces computational overhead

Performance Characteristics:

  • Memory Usage: Same as HLL - log(log(N)) bits
  • Computational Complexity: O(1) per element, same as HLL
  • Accuracy: Unbiased estimates, often better than HLL
  • Mergeability: Full support for union/intersection operations
  • Parallelization: Easily distributable across multiple machines

Implementation Best Practices

# Best practices for HyperReal implementation class OptimizedHyperReal: def __init__(self, precision=14): self.precision = precision self.m = 2 ** precision self.registers = [1.0] * self.m def add(self, element): """Add element to sketch""" # Use consistent hashing hash_val = self._consistent_hash(element) bucket = self._get_bucket(hash_val) normalized_val = self._normalize_hash(element) self.registers[bucket] = min(self.registers[bucket], normalized_val) def _consistent_hash(self, element): """Ensure consistent hashing across different runs""" return hashlib.sha256(str(element).encode('utf-8')).hexdigest() def _normalize_hash(self, element): """Normalize hash to [0,1] range""" hash_int = int(hashlib.sha256(str(element).encode()).hexdigest()[:8], 16) return hash_int / (16 ** 8 - 1) def cardinality(self): """Get cardinality estimate""" return (self.m ** 2) / sum(self.registers) def merge(self, other_sketch): """Merge with another HyperReal sketch""" for i in range(self.m): self.registers[i] = min(self.registers[i], other_sketch.registers[i])
Next: Extended Algorithms →