Sketch Operations - CardinalityKit Documentation

Why Sketch Operations Matter

One of the most powerful features of cardinality estimation sketches is their ability to be merged. This enables cross-platform analytics, audience deduplication, and complex set operations without exposing individual user data.

                🎯 Key Applications
                Cross-Platform Reach: Total unique users across Google + Facebook + TV
Audience Overlap: Users who appear on both platforms
Incremental Reach: New users gained by adding a platform
Geographic Analysis: Audience distribution across regions

            

Set Theory Foundation

Set Operations:

Union: A ∪ B = {x | x ∈ A or x ∈ B}

Intersection: A ∩ B = {x | x ∈ A and x ∈ B}

Difference: A - B = {x | x ∈ A and x ∉ B}

Example: Platform Analytics

Platform A: 1M unique users
Platform B: 800K unique users
Union (Total Reach): 1.5M unique users
Intersection (Overlap): 300K unique users
Platform A Only: 700K unique users
Platform B Only: 500K unique users

HyperLogLog Sketch Operations

🟢 Union Operation (MAX)

To find total unique elements across sketches, take the maximum rank value for each bucket.

Logic:

Higher rank indicates rarer pattern → more unique elements in that bucket

🔴 Intersection Operation (MIN)

To find overlapping elements, take the minimum rank value for each bucket.

Logic:

Lower rank indicates common pattern → elements present in both sketches

def merge_hll_sketches(sketch1, sketch2, operation='union'):
    """Merge two HyperLogLog sketches"""
    
    merged_buckets = []
    
    for i in range(len(sketch1.buckets)):
        if operation == 'union':
            # Union: take maximum rank (more unique elements)
            merged_buckets.append(max(sketch1.buckets[i], sketch2.buckets[i]))
        elif operation == 'intersection':
            # Intersection: take minimum rank (common elements)
            merged_buckets.append(min(sketch1.buckets[i], sketch2.buckets[i]))
    
    # Calculate cardinality from merged buckets
    total = sum(2 ** (-1 * bucket) for bucket in merged_buckets)
    harmonic_mean = total ** -1
    m = len(merged_buckets)
    
    estimate = (m ** 2) * harmonic_mean
    
    # Apply bias correction
    if len(sketch1.buckets) <= 16:
        BIAS = 0.673
    elif len(sketch1.buckets) == 32:
        BIAS = 0.697
    else:
        BIAS = 0.7213 / (1 + (1.079 / m))
    
    return BIAS * estimate
            

HyperLogLog Bucket Merging Example:

6

1

0

11

8

Sketch A

11

1

5

3

10

Sketch B

MAX (Union)

11

1

5

11

10

Union Result

HyperReal Sketch Operations

🟢 Union Operation (MIN)

To find total unique elements, take the minimum hash value for each bucket.

Logic:

Smaller minimum indicates more elements hashed to that bucket

🔴 Intersection Operation (MAX)

To find overlapping elements, take the maximum hash value for each bucket.

Logic:

Larger minimum indicates fewer common elements in that bucket

def merge_hyperreal_sketches(sketch1, sketch2, operation='union'):
    """Merge two HyperReal sketches"""
    
    merged_registers = []
    
    for i in range(len(sketch1.registers)):
        if operation == 'union':
            # Union: take minimum value (more elements)
            merged_registers.append(min(sketch1.registers[i], sketch2.registers[i]))
        elif operation == 'intersection':
            # Intersection: take maximum value (fewer common elements)
            merged_registers.append(max(sketch1.registers[i], sketch2.registers[i]))
    
    # Calculate cardinality (no bias correction needed for HyperReal)
    m = len(merged_registers)
    estimate = (m ** 2) / sum(merged_registers)
    
    return estimate
            

HyperReal Bucket Merging Example:

0.6

0.1

0.8

0.5

0.7

Sketch A

0.11

0.5

0.10

0.3

0.1

Sketch B

MIN (Union)

0.11

0.1

0.10

0.3

0.1

Union Result

Extended Sketch Operations

Extended sketches (with demographic information) require special handling during merging:

def merge_extended_sketches(sketch1, sketch2, operation='union'):
    """Merge extended sketches with demographic preservation"""
    
    merged_sketch = ExtendedHyperRealSketch(sketch1.b_m, sketch1.b_s)
    
    for i in range(sketch1.m):
        if operation == 'union':
            # Take sketch with smaller register value (more elements)
            if sketch1.registers[i] <= sketch2.registers[i]:
                merged_sketch.registers[i] = sketch1.registers[i]
                merged_sketch.frequency_counts[i] = sketch1.frequency_counts[i]
                merged_sketch.attribute_samples[i] = sketch1.attribute_samples[i]
            else:
                merged_sketch.registers[i] = sketch2.registers[i]
                merged_sketch.frequency_counts[i] = sketch2.frequency_counts[i]
                merged_sketch.attribute_samples[i] = sketch2.attribute_samples[i]
            
            # Handle equal values (rare with floats)
            if sketch1.registers[i] == sketch2.registers[i]:
                merged_sketch.frequency_counts[i] = (
                    sketch1.frequency_counts[i] + sketch2.frequency_counts[i]
                )
                
                # Demographic conflict resolution
                if sketch1.attribute_samples[i] != sketch2.attribute_samples[i]:
                    # Weighted selection based on frequency
                    if sketch1.frequency_counts[i] >= sketch2.frequency_counts[i]:
                        merged_sketch.attribute_samples[i] = sketch1.attribute_samples[i]
                    else:
                        merged_sketch.attribute_samples[i] = sketch2.attribute_samples[i]
        
        elif operation == 'intersection':
            # Take sketch with larger register value (fewer common elements)
            if sketch1.registers[i] >= sketch2.registers[i]:
                merged_sketch.registers[i] = sketch1.registers[i]
                merged_sketch.frequency_counts[i] = min(
                    sketch1.frequency_counts[i], sketch2.frequency_counts[i]
                )
                merged_sketch.attribute_samples[i] = sketch1.attribute_samples[i]
            else:
                merged_sketch.registers[i] = sketch2.registers[i]
                merged_sketch.frequency_counts[i] = min(
                    sketch1.frequency_counts[i], sketch2.frequency_counts[i]
                )
                merged_sketch.attribute_samples[i] = sketch2.attribute_samples[i]
    
    return merged_sketch
            

Real-World Use Cases

🌐 Cross-Platform Reach

Measure total unique audience across Google, Facebook, and TV without double-counting

📊 Audience Overlap

Find users who engage with multiple platforms for targeted campaigns

📈 Incremental Analysis

Measure new reach gained by adding additional advertising channels

🎯 Campaign Attribution

Understand which combinations of channels drive unique conversions

🌍 Geographic Insights

Analyze audience distribution and overlap across different regions

📱 Device Analysis

Cross-device user tracking while preserving privacy

Complete Implementation Example

class SketchAnalytics:
    """Complete sketch operations for cross-platform analytics"""
    
    def __init__(self):
        self.sketches = {}
    
    def add_platform_sketch(self, platform_name, sketch):
        """Add a platform sketch to the analytics"""
        self.sketches[platform_name] = sketch
    
    def get_total_reach(self, platforms=None):
        """Calculate total unique reach across specified platforms"""
        if platforms is None:
            platforms = list(self.sketches.keys())
        
        if len(platforms) == 1:
            return self.sketches[platforms[0]].get_cardinality_estimate()
        
        # Start with first platform
        merged_sketch = self.sketches[platforms[0]]
        
        # Union with remaining platforms
        for platform in platforms[1:]:
            merged_sketch = merge_hyperreal_sketches(
                merged_sketch, 
                self.sketches[platform], 
                operation='union'
            )
        
        return merged_sketch
    
    def get_overlap(self, platform1, platform2):
        """Calculate audience overlap between two platforms"""
        return merge_hyperreal_sketches(
            self.sketches[platform1],
            self.sketches[platform2],
            operation='intersection'
        )
    
    def get_incremental_reach(self, base_platforms, new_platform):
        """Calculate incremental reach of adding a new platform"""
        base_reach = self.get_total_reach(base_platforms)
        total_reach = self.get_total_reach(base_platforms + [new_platform])
        
        return total_reach - base_reach
    
    def analyze_platform_mix(self):
        """Complete analysis of platform combinations"""
        platforms = list(self.sketches.keys())
        results = {}
        
        # Individual platform reach
        for platform in platforms:
            results[platform] = self.sketches[platform].get_cardinality_estimate()
        
        # Pairwise overlaps
        for i, p1 in enumerate(platforms):
            for p2 in platforms[i+1:]:
                overlap_key = f"{p1}_∩_{p2}"
                results[overlap_key] = self.get_overlap(p1, p2)
        
        # Total reach
        results['total_reach'] = self.get_total_reach()
        
        # Platform-only audiences (exclusive reach)
        for platform in platforms:
            others = [p for p in platforms if p != platform]
            if others:
                others_reach = self.get_total_reach(others)
                platform_only = results[platform] - self.get_overlap(platform, others[0])
                results[f"{platform}_only"] = max(0, platform_only)
        
        return results

# Usage example
analytics = SketchAnalytics()
analytics.add_platform_sketch('google', google_sketch)
analytics.add_platform_sketch('facebook', facebook_sketch)
analytics.add_platform_sketch('tv', tv_sketch)

# Get comprehensive analysis
results = analytics.analyze_platform_mix()
print(f"Total Reach: {results['total_reach']:,}")
print(f"Google Only: {results['google_only']:,}")
print(f"Facebook ∩ TV: {results['facebook_∩_tv']:,}")
            

Performance and Accuracy

                🎯 Key Properties
                Associative: (A ∪ B) ∪ C = A ∪ (B ∪ C)
Commutative: A ∪ B = B ∪ A
Idempotent: A ∪ A = A
Fast: O(m) time complexity where m = number of buckets
Memory Efficient: No additional memory needed

            

Accuracy Considerations:

Union Accuracy: Generally very good, slight overestimation
Intersection Accuracy: Less accurate, especially for small overlaps
Error Propagation: Errors compound with multiple operations
Sketch Size Impact: Larger sketches (higher k) improve accuracy