Sketch Operations

Merging Sketches for Union and Intersection Analytics

CardinalityKit Documentation

Why Sketch Operations Matter

One of the most powerful features of cardinality estimation sketches is their ability to be merged. This enables cross-platform analytics, audience deduplication, and complex set operations without exposing individual user data.

🎯 Key Applications

  • Cross-Platform Reach: Total unique users across Google + Facebook + TV
  • Audience Overlap: Users who appear on both platforms
  • Incremental Reach: New users gained by adding a platform
  • Geographic Analysis: Audience distribution across regions

Set Theory Foundation

Set Operations:

Union: A ∪ B = {x | x ∈ A or x ∈ B}

Intersection: A ∩ B = {x | x ∈ A and x ∈ B}

Difference: A - B = {x | x ∈ A and x ∉ B}

Example: Platform Analytics

  • Platform A: 1M unique users
  • Platform B: 800K unique users
  • Union (Total Reach): 1.5M unique users
  • Intersection (Overlap): 300K unique users
  • Platform A Only: 700K unique users
  • Platform B Only: 500K unique users

HyperLogLog Sketch Operations

🟢 Union Operation (MAX)

To find total unique elements across sketches, take the maximum rank value for each bucket.

Logic:

Higher rank indicates rarer pattern → more unique elements in that bucket

🔴 Intersection Operation (MIN)

To find overlapping elements, take the minimum rank value for each bucket.

Logic:

Lower rank indicates common pattern → elements present in both sketches

def merge_hll_sketches(sketch1, sketch2, operation='union'): """Merge two HyperLogLog sketches""" merged_buckets = [] for i in range(len(sketch1.buckets)): if operation == 'union': # Union: take maximum rank (more unique elements) merged_buckets.append(max(sketch1.buckets[i], sketch2.buckets[i])) elif operation == 'intersection': # Intersection: take minimum rank (common elements) merged_buckets.append(min(sketch1.buckets[i], sketch2.buckets[i])) # Calculate cardinality from merged buckets total = sum(2 ** (-1 * bucket) for bucket in merged_buckets) harmonic_mean = total ** -1 m = len(merged_buckets) estimate = (m ** 2) * harmonic_mean # Apply bias correction if len(sketch1.buckets) <= 16: BIAS = 0.673 elif len(sketch1.buckets) == 32: BIAS = 0.697 else: BIAS = 0.7213 / (1 + (1.079 / m)) return BIAS * estimate

HyperLogLog Bucket Merging Example:

6
1
0
11
8
Sketch A
11
1
5
3
10
Sketch B
MAX (Union)
11
1
5
11
10
Union Result

HyperReal Sketch Operations

🟢 Union Operation (MIN)

To find total unique elements, take the minimum hash value for each bucket.

Logic:

Smaller minimum indicates more elements hashed to that bucket

🔴 Intersection Operation (MAX)

To find overlapping elements, take the maximum hash value for each bucket.

Logic:

Larger minimum indicates fewer common elements in that bucket

def merge_hyperreal_sketches(sketch1, sketch2, operation='union'): """Merge two HyperReal sketches""" merged_registers = [] for i in range(len(sketch1.registers)): if operation == 'union': # Union: take minimum value (more elements) merged_registers.append(min(sketch1.registers[i], sketch2.registers[i])) elif operation == 'intersection': # Intersection: take maximum value (fewer common elements) merged_registers.append(max(sketch1.registers[i], sketch2.registers[i])) # Calculate cardinality (no bias correction needed for HyperReal) m = len(merged_registers) estimate = (m ** 2) / sum(merged_registers) return estimate

HyperReal Bucket Merging Example:

0.6
0.1
0.8
0.5
0.7
Sketch A
0.11
0.5
0.10
0.3
0.1
Sketch B
MIN (Union)
0.11
0.1
0.10
0.3
0.1
Union Result

Extended Sketch Operations

Extended sketches (with demographic information) require special handling during merging:

def merge_extended_sketches(sketch1, sketch2, operation='union'): """Merge extended sketches with demographic preservation""" merged_sketch = ExtendedHyperRealSketch(sketch1.b_m, sketch1.b_s) for i in range(sketch1.m): if operation == 'union': # Take sketch with smaller register value (more elements) if sketch1.registers[i] <= sketch2.registers[i]: merged_sketch.registers[i] = sketch1.registers[i] merged_sketch.frequency_counts[i] = sketch1.frequency_counts[i] merged_sketch.attribute_samples[i] = sketch1.attribute_samples[i] else: merged_sketch.registers[i] = sketch2.registers[i] merged_sketch.frequency_counts[i] = sketch2.frequency_counts[i] merged_sketch.attribute_samples[i] = sketch2.attribute_samples[i] # Handle equal values (rare with floats) if sketch1.registers[i] == sketch2.registers[i]: merged_sketch.frequency_counts[i] = ( sketch1.frequency_counts[i] + sketch2.frequency_counts[i] ) # Demographic conflict resolution if sketch1.attribute_samples[i] != sketch2.attribute_samples[i]: # Weighted selection based on frequency if sketch1.frequency_counts[i] >= sketch2.frequency_counts[i]: merged_sketch.attribute_samples[i] = sketch1.attribute_samples[i] else: merged_sketch.attribute_samples[i] = sketch2.attribute_samples[i] elif operation == 'intersection': # Take sketch with larger register value (fewer common elements) if sketch1.registers[i] >= sketch2.registers[i]: merged_sketch.registers[i] = sketch1.registers[i] merged_sketch.frequency_counts[i] = min( sketch1.frequency_counts[i], sketch2.frequency_counts[i] ) merged_sketch.attribute_samples[i] = sketch1.attribute_samples[i] else: merged_sketch.registers[i] = sketch2.registers[i] merged_sketch.frequency_counts[i] = min( sketch1.frequency_counts[i], sketch2.frequency_counts[i] ) merged_sketch.attribute_samples[i] = sketch2.attribute_samples[i] return merged_sketch

Real-World Use Cases

🌐 Cross-Platform Reach

Measure total unique audience across Google, Facebook, and TV without double-counting

📊 Audience Overlap

Find users who engage with multiple platforms for targeted campaigns

📈 Incremental Analysis

Measure new reach gained by adding additional advertising channels

🎯 Campaign Attribution

Understand which combinations of channels drive unique conversions

🌍 Geographic Insights

Analyze audience distribution and overlap across different regions

📱 Device Analysis

Cross-device user tracking while preserving privacy

Complete Implementation Example

class SketchAnalytics: """Complete sketch operations for cross-platform analytics""" def __init__(self): self.sketches = {} def add_platform_sketch(self, platform_name, sketch): """Add a platform sketch to the analytics""" self.sketches[platform_name] = sketch def get_total_reach(self, platforms=None): """Calculate total unique reach across specified platforms""" if platforms is None: platforms = list(self.sketches.keys()) if len(platforms) == 1: return self.sketches[platforms[0]].get_cardinality_estimate() # Start with first platform merged_sketch = self.sketches[platforms[0]] # Union with remaining platforms for platform in platforms[1:]: merged_sketch = merge_hyperreal_sketches( merged_sketch, self.sketches[platform], operation='union' ) return merged_sketch def get_overlap(self, platform1, platform2): """Calculate audience overlap between two platforms""" return merge_hyperreal_sketches( self.sketches[platform1], self.sketches[platform2], operation='intersection' ) def get_incremental_reach(self, base_platforms, new_platform): """Calculate incremental reach of adding a new platform""" base_reach = self.get_total_reach(base_platforms) total_reach = self.get_total_reach(base_platforms + [new_platform]) return total_reach - base_reach def analyze_platform_mix(self): """Complete analysis of platform combinations""" platforms = list(self.sketches.keys()) results = {} # Individual platform reach for platform in platforms: results[platform] = self.sketches[platform].get_cardinality_estimate() # Pairwise overlaps for i, p1 in enumerate(platforms): for p2 in platforms[i+1:]: overlap_key = f"{p1}_∩_{p2}" results[overlap_key] = self.get_overlap(p1, p2) # Total reach results['total_reach'] = self.get_total_reach() # Platform-only audiences (exclusive reach) for platform in platforms: others = [p for p in platforms if p != platform] if others: others_reach = self.get_total_reach(others) platform_only = results[platform] - self.get_overlap(platform, others[0]) results[f"{platform}_only"] = max(0, platform_only) return results # Usage example analytics = SketchAnalytics() analytics.add_platform_sketch('google', google_sketch) analytics.add_platform_sketch('facebook', facebook_sketch) analytics.add_platform_sketch('tv', tv_sketch) # Get comprehensive analysis results = analytics.analyze_platform_mix() print(f"Total Reach: {results['total_reach']:,}") print(f"Google Only: {results['google_only']:,}") print(f"Facebook ∩ TV: {results['facebook_∩_tv']:,}")

Performance and Accuracy

🎯 Key Properties

  • Associative: (A ∪ B) ∪ C = A ∪ (B ∪ C)
  • Commutative: A ∪ B = B ∪ A
  • Idempotent: A ∪ A = A
  • Fast: O(m) time complexity where m = number of buckets
  • Memory Efficient: No additional memory needed

Accuracy Considerations:

  • Union Accuracy: Generally very good, slight overestimation
  • Intersection Accuracy: Less accurate, especially for small overlaps
  • Error Propagation: Errors compound with multiple operations
  • Sketch Size Impact: Larger sketches (higher k) improve accuracy
Next: Simulations & Results →