What is HyperReal?
HyperReal (HR) is an innovative cardinality estimation algorithm that provides an unbiased alternative to HyperLogLog. While maintaining the same memory efficiency and privacy benefits, HyperReal eliminates the systematic bias present in HLL estimates.
🎯 Key Innovation
Instead of counting trailing zeros like HLL, HyperReal uses minimum hash values normalized to the range [0,1]. This approach provides mathematically unbiased estimates while preserving all the benefits of sketch-based cardinality estimation.
From HyperLogLog to HyperReal
🔵 HyperLogLog Approach
- Count trailing zeros in hash
- Store maximum zero count per bucket
- Use harmonic mean for estimation
- Apply bias correction factors
- Requires range corrections
🟢 HyperReal Approach
- Normalize hash to [0,1] range
- Store minimum value per bucket
- Use direct mathematical formula
- Inherently unbiased
- No corrections needed
Mathematical Foundation
Why This Works:
- Uniform Distribution: Hash function maps inputs uniformly to [0,1]
- Minimum Statistics: Minimum of uniform random variables follows known distribution
- Beta Distribution: M[j] follows Beta(1, n/m) distribution
- Exponential Approximation: For large n, Beta approximates exponential distribution
- Unbiased Estimation: Mathematical expectation equals true cardinality
HyperReal Implementation
def HyperReal(hashes, k):
m = 2 ** k # Number of buckets
buckets = [1.0] * m # Initialize to 1.0 (maximum possible value)
for hash_val in hashes:
# Normalize hash to [0,1] range
int_val = int(hashlib.sha256(str(hash_val).encode()).hexdigest()[:8], 16)
x = int_val / (16 ** 8 - 1) # Normalize to [0,1]
# Get bucket index (first k bits)
j = int(str(hash_val)[:k], 2)
# Keep minimum value per bucket
buckets[j] = min(buckets[j], x)
# Calculate unbiased estimate
estimate = (m ** 2) / sum(buckets)
return estimate
Key Differences from HLL:
- Initialize buckets to 1.0 instead of 0
- Store minimum hash values instead of maximum zero counts
- Use sum instead of harmonic mean
- No bias correction needed
Enhanced HyperReal with Double Hashing
CardinalityKit implementation uses double hashing for increased entropy:
class HyperRealSketch:
def __init__(self, b_m):
self.b_m = b_m
self.m = 2 ** b_m
self.registers = [1.0] * self.m
def _hash_function(self, data):
# Double hashing for increased entropy
inner_hash = hashlib.sha256(str(data).encode('utf-8')).hexdigest()
outer_hash = hashlib.sha256(str(inner_hash).encode('utf-8')).hexdigest()
# Convert to 256-bit binary
hashed_ = '{:256b}'.format(int(outer_hash, 16)).replace(' ', '0')
return hashed_
def update_sketch(self, user_id):
x = self._hash_function(user_id)
j = int(str(x)[:self.b_m], 2) # Bucket index
# Normalize hash to [0,1]
int_val = int(hashlib.sha256(str(user_id).encode()).hexdigest()[:8], 16)
w = int_val / (16 ** 8 - 1)
# Keep minimum value
self.registers[j] = min(self.registers[j], w)
def get_cardinality_estimate(self):
return (self.m ** 2) / sum(self.registers)
Bucket Visualization
HyperLogLog vs HyperReal Buckets:
HyperLogLog Buckets (Zero Counts):
HyperReal Buckets (Minimum Values):
0.08
0.11
0.57
0.03
0.63
0.82
0.18
0.12
Performance Comparison: HLL vs HR
# Simulation results from CardinalityKit
# Using NYC Parking Violations dataset
def compare_hll_hr():
results = {}
for k in range(2, 20): # Different bucket sizes
hll_estimates = []
hr_estimates = []
for iteration in range(8): # Multiple runs
# Generate test hashes with different offsets
hashes = generate_test_hashes(iteration)
# Run both algorithms
hll_estimate = HyperLogLog(hashes, k)
hr_estimate = HyperReal(hashes, k)
hll_estimates.append(hll_estimate)
hr_estimates.append(hr_estimate)
# Calculate statistics
results[k] = {
'HLL_mean': statistics.mean(hll_estimates),
'HR_mean': statistics.mean(hr_estimates),
'HLL_error': abs(exact_count - statistics.mean(hll_estimates)) / exact_count,
'HR_error': abs(exact_count - statistics.mean(hr_estimates)) / exact_count
}
return results
🎯 HyperReal Advantages:
- Unbiased Estimates: No systematic over/under-estimation
- Simpler Mathematics: No complex bias correction formulas
- Better Accuracy: Especially for smaller cardinalities
- Consistent Performance: More predictable across different data distributions
- Easier Implementation: Fewer edge cases and corrections
Sketch Operations
Like HyperLogLog, HyperReal sketches support union and intersection operations:
def merge_hyperreal_union(sketch1, sketch2):
"""Merge two HyperReal sketches for UNION operation"""
merged_registers = []
for i in range(len(sketch1.registers)):
# Take minimum value for union
merged_registers.append(min(sketch1.registers[i], sketch2.registers[i]))
return merged_registers
def merge_hyperreal_intersection(sketch1, sketch2):
"""Merge two HyperReal sketches for INTERSECTION operation"""
merged_registers = []
for i in range(len(sketch1.registers)):
# Take maximum value for intersection
merged_registers.append(max(sketch1.registers[i], sketch2.registers[i]))
return merged_registers
Merge Operations Visualization:
Union (MIN operation):
Sketch A: [0.6, 0.1, 0.8, 0.5] + Sketch B: [0.11, 0.5, 0.10, 0.1] = [0.11, 0.1, 0.10, 0.1]
Intersection (MAX operation):
Sketch A: [0.6, 0.1, 0.8, 0.5] + Sketch B: [0.11, 0.5, 0.10, 0.1] = [0.6, 0.5, 0.8, 0.5]
When to Use HyperReal
🎯 Ideal Use Cases:
- High Accuracy Requirements: When bias matters for business decisions
- Small to Medium Cardinalities: HR shows better accuracy for smaller datasets
- Research Applications: When mathematical properties need to be well-understood
- Cross-Platform Analytics: Consistent behavior across different data sources
- Real-Time Systems: Simpler implementation reduces computational overhead
Performance Characteristics:
- Memory Usage: Same as HLL - log(log(N)) bits
- Computational Complexity: O(1) per element, same as HLL
- Accuracy: Unbiased estimates, often better than HLL
- Mergeability: Full support for union/intersection operations
- Parallelization: Easily distributable across multiple machines
Implementation Best Practices
# Best practices for HyperReal implementation
class OptimizedHyperReal:
def __init__(self, precision=14):
self.precision = precision
self.m = 2 ** precision
self.registers = [1.0] * self.m
def add(self, element):
"""Add element to sketch"""
# Use consistent hashing
hash_val = self._consistent_hash(element)
bucket = self._get_bucket(hash_val)
normalized_val = self._normalize_hash(element)
self.registers[bucket] = min(self.registers[bucket], normalized_val)
def _consistent_hash(self, element):
"""Ensure consistent hashing across different runs"""
return hashlib.sha256(str(element).encode('utf-8')).hexdigest()
def _normalize_hash(self, element):
"""Normalize hash to [0,1] range"""
hash_int = int(hashlib.sha256(str(element).encode()).hexdigest()[:8], 16)
return hash_int / (16 ** 8 - 1)
def cardinality(self):
"""Get cardinality estimate"""
return (self.m ** 2) / sum(self.registers)
def merge(self, other_sketch):
"""Merge with another HyperReal sketch"""
for i in range(self.m):
self.registers[i] = min(self.registers[i], other_sketch.registers[i])