Why Sketch Operations Matter
One of the most powerful features of cardinality estimation sketches is their ability to be merged. This enables cross-platform analytics, audience deduplication, and complex set operations without exposing individual user data.
🎯 Key Applications
- Cross-Platform Reach: Total unique users across Google + Facebook + TV
- Audience Overlap: Users who appear on both platforms
- Incremental Reach: New users gained by adding a platform
- Geographic Analysis: Audience distribution across regions
Set Theory Foundation
Example: Platform Analytics
- Platform A: 1M unique users
- Platform B: 800K unique users
- Union (Total Reach): 1.5M unique users
- Intersection (Overlap): 300K unique users
- Platform A Only: 700K unique users
- Platform B Only: 500K unique users
HyperLogLog Sketch Operations
🟢 Union Operation (MAX)
To find total unique elements across sketches, take the maximum rank value for each bucket.
Logic:
Higher rank indicates rarer pattern → more unique elements in that bucket
🔴 Intersection Operation (MIN)
To find overlapping elements, take the minimum rank value for each bucket.
Logic:
Lower rank indicates common pattern → elements present in both sketches
def merge_hll_sketches(sketch1, sketch2, operation='union'):
"""Merge two HyperLogLog sketches"""
merged_buckets = []
for i in range(len(sketch1.buckets)):
if operation == 'union':
# Union: take maximum rank (more unique elements)
merged_buckets.append(max(sketch1.buckets[i], sketch2.buckets[i]))
elif operation == 'intersection':
# Intersection: take minimum rank (common elements)
merged_buckets.append(min(sketch1.buckets[i], sketch2.buckets[i]))
# Calculate cardinality from merged buckets
total = sum(2 ** (-1 * bucket) for bucket in merged_buckets)
harmonic_mean = total ** -1
m = len(merged_buckets)
estimate = (m ** 2) * harmonic_mean
# Apply bias correction
if len(sketch1.buckets) <= 16:
BIAS = 0.673
elif len(sketch1.buckets) == 32:
BIAS = 0.697
else:
BIAS = 0.7213 / (1 + (1.079 / m))
return BIAS * estimate
HyperLogLog Bucket Merging Example:
Sketch A
Sketch B
MAX (Union)
Union Result
HyperReal Sketch Operations
🟢 Union Operation (MIN)
To find total unique elements, take the minimum hash value for each bucket.
Logic:
Smaller minimum indicates more elements hashed to that bucket
🔴 Intersection Operation (MAX)
To find overlapping elements, take the maximum hash value for each bucket.
Logic:
Larger minimum indicates fewer common elements in that bucket
def merge_hyperreal_sketches(sketch1, sketch2, operation='union'):
"""Merge two HyperReal sketches"""
merged_registers = []
for i in range(len(sketch1.registers)):
if operation == 'union':
# Union: take minimum value (more elements)
merged_registers.append(min(sketch1.registers[i], sketch2.registers[i]))
elif operation == 'intersection':
# Intersection: take maximum value (fewer common elements)
merged_registers.append(max(sketch1.registers[i], sketch2.registers[i]))
# Calculate cardinality (no bias correction needed for HyperReal)
m = len(merged_registers)
estimate = (m ** 2) / sum(merged_registers)
return estimate
HyperReal Bucket Merging Example:
Sketch A
Sketch B
MIN (Union)
Union Result
Extended Sketch Operations
Extended sketches (with demographic information) require special handling during merging:
def merge_extended_sketches(sketch1, sketch2, operation='union'):
"""Merge extended sketches with demographic preservation"""
merged_sketch = ExtendedHyperRealSketch(sketch1.b_m, sketch1.b_s)
for i in range(sketch1.m):
if operation == 'union':
# Take sketch with smaller register value (more elements)
if sketch1.registers[i] <= sketch2.registers[i]:
merged_sketch.registers[i] = sketch1.registers[i]
merged_sketch.frequency_counts[i] = sketch1.frequency_counts[i]
merged_sketch.attribute_samples[i] = sketch1.attribute_samples[i]
else:
merged_sketch.registers[i] = sketch2.registers[i]
merged_sketch.frequency_counts[i] = sketch2.frequency_counts[i]
merged_sketch.attribute_samples[i] = sketch2.attribute_samples[i]
# Handle equal values (rare with floats)
if sketch1.registers[i] == sketch2.registers[i]:
merged_sketch.frequency_counts[i] = (
sketch1.frequency_counts[i] + sketch2.frequency_counts[i]
)
# Demographic conflict resolution
if sketch1.attribute_samples[i] != sketch2.attribute_samples[i]:
# Weighted selection based on frequency
if sketch1.frequency_counts[i] >= sketch2.frequency_counts[i]:
merged_sketch.attribute_samples[i] = sketch1.attribute_samples[i]
else:
merged_sketch.attribute_samples[i] = sketch2.attribute_samples[i]
elif operation == 'intersection':
# Take sketch with larger register value (fewer common elements)
if sketch1.registers[i] >= sketch2.registers[i]:
merged_sketch.registers[i] = sketch1.registers[i]
merged_sketch.frequency_counts[i] = min(
sketch1.frequency_counts[i], sketch2.frequency_counts[i]
)
merged_sketch.attribute_samples[i] = sketch1.attribute_samples[i]
else:
merged_sketch.registers[i] = sketch2.registers[i]
merged_sketch.frequency_counts[i] = min(
sketch1.frequency_counts[i], sketch2.frequency_counts[i]
)
merged_sketch.attribute_samples[i] = sketch2.attribute_samples[i]
return merged_sketch
Real-World Use Cases
🌐 Cross-Platform Reach
Measure total unique audience across Google, Facebook, and TV without double-counting
📊 Audience Overlap
Find users who engage with multiple platforms for targeted campaigns
📈 Incremental Analysis
Measure new reach gained by adding additional advertising channels
🎯 Campaign Attribution
Understand which combinations of channels drive unique conversions
🌍 Geographic Insights
Analyze audience distribution and overlap across different regions
📱 Device Analysis
Cross-device user tracking while preserving privacy
Complete Implementation Example
class SketchAnalytics:
"""Complete sketch operations for cross-platform analytics"""
def __init__(self):
self.sketches = {}
def add_platform_sketch(self, platform_name, sketch):
"""Add a platform sketch to the analytics"""
self.sketches[platform_name] = sketch
def get_total_reach(self, platforms=None):
"""Calculate total unique reach across specified platforms"""
if platforms is None:
platforms = list(self.sketches.keys())
if len(platforms) == 1:
return self.sketches[platforms[0]].get_cardinality_estimate()
# Start with first platform
merged_sketch = self.sketches[platforms[0]]
# Union with remaining platforms
for platform in platforms[1:]:
merged_sketch = merge_hyperreal_sketches(
merged_sketch,
self.sketches[platform],
operation='union'
)
return merged_sketch
def get_overlap(self, platform1, platform2):
"""Calculate audience overlap between two platforms"""
return merge_hyperreal_sketches(
self.sketches[platform1],
self.sketches[platform2],
operation='intersection'
)
def get_incremental_reach(self, base_platforms, new_platform):
"""Calculate incremental reach of adding a new platform"""
base_reach = self.get_total_reach(base_platforms)
total_reach = self.get_total_reach(base_platforms + [new_platform])
return total_reach - base_reach
def analyze_platform_mix(self):
"""Complete analysis of platform combinations"""
platforms = list(self.sketches.keys())
results = {}
# Individual platform reach
for platform in platforms:
results[platform] = self.sketches[platform].get_cardinality_estimate()
# Pairwise overlaps
for i, p1 in enumerate(platforms):
for p2 in platforms[i+1:]:
overlap_key = f"{p1}_∩_{p2}"
results[overlap_key] = self.get_overlap(p1, p2)
# Total reach
results['total_reach'] = self.get_total_reach()
# Platform-only audiences (exclusive reach)
for platform in platforms:
others = [p for p in platforms if p != platform]
if others:
others_reach = self.get_total_reach(others)
platform_only = results[platform] - self.get_overlap(platform, others[0])
results[f"{platform}_only"] = max(0, platform_only)
return results
# Usage example
analytics = SketchAnalytics()
analytics.add_platform_sketch('google', google_sketch)
analytics.add_platform_sketch('facebook', facebook_sketch)
analytics.add_platform_sketch('tv', tv_sketch)
# Get comprehensive analysis
results = analytics.analyze_platform_mix()
print(f"Total Reach: {results['total_reach']:,}")
print(f"Google Only: {results['google_only']:,}")
print(f"Facebook ∩ TV: {results['facebook_∩_tv']:,}")
Performance and Accuracy
🎯 Key Properties
- Associative: (A ∪ B) ∪ C = A ∪ (B ∪ C)
- Commutative: A ∪ B = B ∪ A
- Idempotent: A ∪ A = A
- Fast: O(m) time complexity where m = number of buckets
- Memory Efficient: No additional memory needed
Accuracy Considerations:
- Union Accuracy: Generally very good, slight overestimation
- Intersection Accuracy: Less accurate, especially for small overlaps
- Error Propagation: Errors compound with multiple operations
- Sketch Size Impact: Larger sketches (higher k) improve accuracy