Panel to HyperReal

Converting TV Panel Data to HyperReal Sketches

CardinalityKit Documentation

The Panel Conversion Challenge

Traditional TV audience measurement relies on panels - small representative samples of the population. Converting panel data to HyperReal sketches enables integration with digital measurement while preserving privacy and enabling cross-platform analytics.

🎯 Key Challenge

How do we convert a small panel (e.g., 1,000 households representing 100 million people) into a HyperReal sketch that accurately represents the full population while maintaining demographic distributions?

Virtual People Concept

Panel Expansion Process:

Panel
1,000 people
Virtual People
100,000 people
HyperReal Sketch
64KB memory

Each panelist represents multiple "virtual people" in the full population. The challenge is associating virtual people to panelists while maintaining proper demographic distributions.

Affinity Hashing Foundation

Affinity Hashing Function:

h'(h(p), q) → [0,1]

Maps virtual person p's hash and panelist index q to affinity score

def bit_accuracy(u1, u2): """Calculate bit-level similarity between two hash values""" bin_U1 = [i for i in bin(u1)[2:]][:8] # First 8 bits bin_U2 = [i for i in bin(u2)[2:]][:8] # First 8 bits # Calculate proportion of matching bits accuracy = np.sum([b1 == b2 for b1, b2 in zip(bin_U1, bin_U2)]) / 9 return accuracy def affinity_hashing(virtual_person, panelist): """Calculate affinity between virtual person and panelist""" vp_hash = int(hashlib.sha256(str(virtual_person).encode()).hexdigest()[:8], 16) pan_hash = int(hashlib.sha256(str(panelist).encode()).hexdigest()[:8], 16) return bit_accuracy(vp_hash, pan_hash)

🔬 Why Bit Accuracy Works

Bit accuracy provides a uniform distribution of affinities between virtual people and panelists. This ensures that each panelist gets a representative sample of virtual people, maintaining demographic balance.

Method 1: Naive Association

🟡 Naive Association

Approach:

  • Generate all virtual people upfront
  • Assign each to best-matching panelist
  • Create HyperReal sketch from assignments

Pros:

  • Simple to understand
  • Accurate demographic preservation
  • Complete virtual population

Cons:

  • High memory usage
  • Slow processing (~1 hour)
  • Not scalable

🟢 Fast Association

Approach:

  • Generate virtual people on-demand
  • Keep only top D per panelist
  • Fill remaining buckets probabilistically

Pros:

  • 4x faster processing (~15 min)
  • Lower memory usage
  • Scalable to larger populations

Cons:

  • More complex implementation
  • Requires parameter tuning (D)

Naive Association Implementation

class ExtendedHyperRealSketchFromPanel: def __init__(self, b_m, panelists): self.b_m = b_m self.panelists = panelists # [(id, weight, attribute), ...] self.m = 2 ** b_m self.registers = [1.0] * self.m self.frequency_counts = [0] * self.m self.attribute_samples = [None] * self.m def NaiveAssociate(self, sum_weights=100000): """Naive association method - assign all virtual people to panelists""" # Create virtual people containers for each panelist virtual_people = {panelist_id: [] for panelist_id, _, _ in self.panelists} # Generate and assign all virtual people for i in tqdm(range(sum_weights)): virtual_person = f"VirtualPerson_{i}" # Calculate affinity to each panelist vp_hash = int(hashlib.sha256(virtual_person.encode()).hexdigest()[:8], 16) affinities = [] for panelist_id, weight, attr in self.panelists: affinity = bit_accuracy(vp_hash, int(hashlib.sha256(str(panelist_id).encode()).hexdigest()[:8], 16)) # Weight by inverse log probability score = -np.log(affinity + 0.000001) / weight affinities.append(score) # Assign to best panelist best_panelist_idx = affinities.index(min(affinities)) panelist_id, weight, attr = self.panelists[best_panelist_idx] virtual_people[panelist_id].append((virtual_person, attr)) # Create HyperReal sketch from virtual people for panelist_id, vp_list in virtual_people.items(): for virtual_person, user_attr in vp_list: self._add_to_sketch(virtual_person, user_attr) def _add_to_sketch(self, user_id, user_attr): """Add virtual person to HyperReal sketch""" x = self._hash_function(user_id) j = int(str(x)[:self.b_m], 2) # Bucket index # Normalize hash to [0,1] int_val = int(hashlib.sha256(str(user_id).encode()).hexdigest()[:8], 16) w = int_val / (16 ** 8 - 1) # Update sketch if self.registers[j] > w: self.frequency_counts[j] = 1 self.attribute_samples[j] = user_attr elif self.registers[j] == w: self.frequency_counts[j] += 1 if self.attribute_samples[j] != user_attr: self.attribute_samples[j] = user_attr self.registers[j] = min(self.registers[j], w)

Fast Association Implementation

def FastAssociate(self, sum_weights=100000, D=15): """Fast association method - keep only top D virtual people per panelist""" # Initialize virtual people containers with limited size virtual_people = { panelist_id: [('panelist_fake', 'Attribute_fake', 1) for _ in range(D-1)] for panelist_id, _, _ in self.panelists } # Process virtual people in batches for i in tqdm(range(sum_weights * D)): virtual_person = f"VirtualPerson_{i}_{generate_random_string(10)}" # Calculate hash and find best panelist vp_hash = int(hashlib.sha256(virtual_person.encode()).hexdigest()[:8], 16) affinities = [] for panelist_id, weight, attr in self.panelists: affinity = bit_accuracy(vp_hash, int(hashlib.sha256(str(panelist_id).encode()).hexdigest()[:8], 16)) score = -np.log(affinity + 0.000001) / weight affinities.append(score) # Get best panelist best_idx = affinities.index(min(affinities)) user_id, weight, user_attr = self.panelists[best_idx] # Insert into sorted list (keep only top D) hashed_w = vp_hash / (16 ** 8 - 1) new_panelist = [] flag_inserted = False for vp, attr, w in virtual_people[user_id]: if not flag_inserted and hashed_w < w: new_panelist.append((virtual_person, user_attr, hashed_w)) flag_inserted = True else: new_panelist.append((vp, attr, w)) virtual_people[user_id] = new_panelist # Fill remaining sketch buckets probabilistically self._fill_remaining_buckets(virtual_people, sum_weights) # Create sketch from top virtual people for panelist_id, vp_list in virtual_people.items(): for virtual_person, user_attr, _ in vp_list: if virtual_person != 'panelist_fake': self._add_to_sketch(virtual_person, user_attr)

Performance Comparison

Method Panel Size Universe Size Processing Time Memory Usage Cardinality Error Demographic Error
Naive Association 0.75% 100,000 ~1 hour ~500MB -0.52% +14.25%
Fast Association 0.75% 100,000 ~15 minutes ~64KB -0.48% +12.8%

🎯 Key Findings

  • Speed: Fast association is 4x faster than naive
  • Memory: Fast association uses 8000x less memory
  • Accuracy: Minimal difference in cardinality estimation
  • Demographics: Both methods preserve demographic distributions well

Parameter Optimization

def optimize_D_parameter(): """Find optimal D parameter for fast association""" results = {} for D in range(5, 25, 2): # Test different D values errors = [] times = [] for trial in range(5): # Multiple trials start_time = time.time() sketch = ExtendedHyperRealSketchFromPanel(b_m=14, panelists=panelists) sketch.FastAssociate(sum_weights=100000, D=D) processing_time = time.time() - start_time cardinality_error = abs(sketch.get_cardinality_estimate() - 100000) / 100000 errors.append(cardinality_error) times.append(processing_time) results[D] = { 'avg_error': np.mean(errors), 'avg_time': np.mean(times), 'std_error': np.std(errors) } return results # Optimal D found to be around 15 for most scenarios

D Parameter Impact:

  • D=5: Fast but higher error (~2%)
  • D=15: Optimal balance (0.5% error, 15min)
  • D=25: Slower with minimal accuracy gain

Real-World Application

# Example: TV Panel to Digital Integration def integrate_tv_digital_measurement(): """Integrate TV panel data with digital measurement""" # Load TV panel data tv_panel = load_tv_panel_data() # Small representative sample # Convert to HyperReal sketch tv_sketch = ExtendedHyperRealSketchFromPanel(b_m=14, panelists=tv_panel) tv_sketch.FastAssociate(sum_weights=tv_universe_size) # Load digital measurement sketches google_sketch = load_digital_sketch('google') facebook_sketch = load_digital_sketch('facebook') # Merge all sketches for total reach total_sketch = merge_sketches([tv_sketch, google_sketch, facebook_sketch]) # Calculate deduplicated audience total_reach = total_sketch.get_cardinality_estimate() # Get demographic breakdowns demographics = total_sketch.get_frequency_for_attr() return { 'total_reach': total_reach, 'tv_only': tv_sketch.get_cardinality_estimate(), 'digital_only': merge_sketches([google_sketch, facebook_sketch]).get_cardinality_estimate(), 'demographics': demographics }

Advantages and Limitations

✅ Advantages

  • Privacy Preserving: No individual-level data exposed
  • Scalable: Works with any panel size
  • Mergeable: Can combine with digital sketches
  • Demographic Aware: Preserves attribute distributions
  • Memory Efficient: Fixed sketch size regardless of universe

⚠️ Limitations

  • Panel Quality: Results depend on representative panel
  • Hash Consistency: Requires same hashing across platforms
  • Demographic Conflicts: Simple resolution strategies
  • Parameter Tuning: D parameter needs optimization
  • Approximation: Inherent estimation errors
Next: Sketch Operations →