The Panel Conversion Challenge
Traditional TV audience measurement relies on panels - small representative samples of the population. Converting panel data to HyperReal sketches enables integration with digital measurement while preserving privacy and enabling cross-platform analytics.
🎯 Key Challenge
How do we convert a small panel (e.g., 1,000 households representing 100 million people) into a HyperReal sketch that accurately represents the full population while maintaining demographic distributions?
Virtual People Concept
Panel Expansion Process:
Panel
1,000 people
→
Virtual People
100,000 people
→
HyperReal Sketch
64KB memory
Each panelist represents multiple "virtual people" in the full population. The challenge is associating virtual people to panelists while maintaining proper demographic distributions.
Affinity Hashing Foundation
def bit_accuracy(u1, u2):
"""Calculate bit-level similarity between two hash values"""
bin_U1 = [i for i in bin(u1)[2:]][:8] # First 8 bits
bin_U2 = [i for i in bin(u2)[2:]][:8] # First 8 bits
# Calculate proportion of matching bits
accuracy = np.sum([b1 == b2 for b1, b2 in zip(bin_U1, bin_U2)]) / 9
return accuracy
def affinity_hashing(virtual_person, panelist):
"""Calculate affinity between virtual person and panelist"""
vp_hash = int(hashlib.sha256(str(virtual_person).encode()).hexdigest()[:8], 16)
pan_hash = int(hashlib.sha256(str(panelist).encode()).hexdigest()[:8], 16)
return bit_accuracy(vp_hash, pan_hash)
🔬 Why Bit Accuracy Works
Bit accuracy provides a uniform distribution of affinities between virtual people and panelists. This ensures that each panelist gets a representative sample of virtual people, maintaining demographic balance.
Method 1: Naive Association
🟡 Naive Association
Approach:
- Generate all virtual people upfront
- Assign each to best-matching panelist
- Create HyperReal sketch from assignments
Pros:
- Simple to understand
- Accurate demographic preservation
- Complete virtual population
Cons:
- High memory usage
- Slow processing (~1 hour)
- Not scalable
🟢 Fast Association
Approach:
- Generate virtual people on-demand
- Keep only top D per panelist
- Fill remaining buckets probabilistically
Pros:
- 4x faster processing (~15 min)
- Lower memory usage
- Scalable to larger populations
Cons:
- More complex implementation
- Requires parameter tuning (D)
Naive Association Implementation
class ExtendedHyperRealSketchFromPanel:
def __init__(self, b_m, panelists):
self.b_m = b_m
self.panelists = panelists # [(id, weight, attribute), ...]
self.m = 2 ** b_m
self.registers = [1.0] * self.m
self.frequency_counts = [0] * self.m
self.attribute_samples = [None] * self.m
def NaiveAssociate(self, sum_weights=100000):
"""Naive association method - assign all virtual people to panelists"""
# Create virtual people containers for each panelist
virtual_people = {panelist_id: [] for panelist_id, _, _ in self.panelists}
# Generate and assign all virtual people
for i in tqdm(range(sum_weights)):
virtual_person = f"VirtualPerson_{i}"
# Calculate affinity to each panelist
vp_hash = int(hashlib.sha256(virtual_person.encode()).hexdigest()[:8], 16)
affinities = []
for panelist_id, weight, attr in self.panelists:
affinity = bit_accuracy(vp_hash,
int(hashlib.sha256(str(panelist_id).encode()).hexdigest()[:8], 16))
# Weight by inverse log probability
score = -np.log(affinity + 0.000001) / weight
affinities.append(score)
# Assign to best panelist
best_panelist_idx = affinities.index(min(affinities))
panelist_id, weight, attr = self.panelists[best_panelist_idx]
virtual_people[panelist_id].append((virtual_person, attr))
# Create HyperReal sketch from virtual people
for panelist_id, vp_list in virtual_people.items():
for virtual_person, user_attr in vp_list:
self._add_to_sketch(virtual_person, user_attr)
def _add_to_sketch(self, user_id, user_attr):
"""Add virtual person to HyperReal sketch"""
x = self._hash_function(user_id)
j = int(str(x)[:self.b_m], 2) # Bucket index
# Normalize hash to [0,1]
int_val = int(hashlib.sha256(str(user_id).encode()).hexdigest()[:8], 16)
w = int_val / (16 ** 8 - 1)
# Update sketch
if self.registers[j] > w:
self.frequency_counts[j] = 1
self.attribute_samples[j] = user_attr
elif self.registers[j] == w:
self.frequency_counts[j] += 1
if self.attribute_samples[j] != user_attr:
self.attribute_samples[j] = user_attr
self.registers[j] = min(self.registers[j], w)
Fast Association Implementation
def FastAssociate(self, sum_weights=100000, D=15):
"""Fast association method - keep only top D virtual people per panelist"""
# Initialize virtual people containers with limited size
virtual_people = {
panelist_id: [('panelist_fake', 'Attribute_fake', 1) for _ in range(D-1)]
for panelist_id, _, _ in self.panelists
}
# Process virtual people in batches
for i in tqdm(range(sum_weights * D)):
virtual_person = f"VirtualPerson_{i}_{generate_random_string(10)}"
# Calculate hash and find best panelist
vp_hash = int(hashlib.sha256(virtual_person.encode()).hexdigest()[:8], 16)
affinities = []
for panelist_id, weight, attr in self.panelists:
affinity = bit_accuracy(vp_hash,
int(hashlib.sha256(str(panelist_id).encode()).hexdigest()[:8], 16))
score = -np.log(affinity + 0.000001) / weight
affinities.append(score)
# Get best panelist
best_idx = affinities.index(min(affinities))
user_id, weight, user_attr = self.panelists[best_idx]
# Insert into sorted list (keep only top D)
hashed_w = vp_hash / (16 ** 8 - 1)
new_panelist = []
flag_inserted = False
for vp, attr, w in virtual_people[user_id]:
if not flag_inserted and hashed_w < w:
new_panelist.append((virtual_person, user_attr, hashed_w))
flag_inserted = True
else:
new_panelist.append((vp, attr, w))
virtual_people[user_id] = new_panelist
# Fill remaining sketch buckets probabilistically
self._fill_remaining_buckets(virtual_people, sum_weights)
# Create sketch from top virtual people
for panelist_id, vp_list in virtual_people.items():
for virtual_person, user_attr, _ in vp_list:
if virtual_person != 'panelist_fake':
self._add_to_sketch(virtual_person, user_attr)
Performance Comparison
🎯 Key Findings
- Speed: Fast association is 4x faster than naive
- Memory: Fast association uses 8000x less memory
- Accuracy: Minimal difference in cardinality estimation
- Demographics: Both methods preserve demographic distributions well
Parameter Optimization
def optimize_D_parameter():
"""Find optimal D parameter for fast association"""
results = {}
for D in range(5, 25, 2): # Test different D values
errors = []
times = []
for trial in range(5): # Multiple trials
start_time = time.time()
sketch = ExtendedHyperRealSketchFromPanel(b_m=14, panelists=panelists)
sketch.FastAssociate(sum_weights=100000, D=D)
processing_time = time.time() - start_time
cardinality_error = abs(sketch.get_cardinality_estimate() - 100000) / 100000
errors.append(cardinality_error)
times.append(processing_time)
results[D] = {
'avg_error': np.mean(errors),
'avg_time': np.mean(times),
'std_error': np.std(errors)
}
return results
# Optimal D found to be around 15 for most scenarios
D Parameter Impact:
- D=5: Fast but higher error (~2%)
- D=15: Optimal balance (0.5% error, 15min)
- D=25: Slower with minimal accuracy gain
Real-World Application
# Example: TV Panel to Digital Integration
def integrate_tv_digital_measurement():
"""Integrate TV panel data with digital measurement"""
# Load TV panel data
tv_panel = load_tv_panel_data() # Small representative sample
# Convert to HyperReal sketch
tv_sketch = ExtendedHyperRealSketchFromPanel(b_m=14, panelists=tv_panel)
tv_sketch.FastAssociate(sum_weights=tv_universe_size)
# Load digital measurement sketches
google_sketch = load_digital_sketch('google')
facebook_sketch = load_digital_sketch('facebook')
# Merge all sketches for total reach
total_sketch = merge_sketches([tv_sketch, google_sketch, facebook_sketch])
# Calculate deduplicated audience
total_reach = total_sketch.get_cardinality_estimate()
# Get demographic breakdowns
demographics = total_sketch.get_frequency_for_attr()
return {
'total_reach': total_reach,
'tv_only': tv_sketch.get_cardinality_estimate(),
'digital_only': merge_sketches([google_sketch, facebook_sketch]).get_cardinality_estimate(),
'demographics': demographics
}
Advantages and Limitations
✅ Advantages
- Privacy Preserving: No individual-level data exposed
- Scalable: Works with any panel size
- Mergeable: Can combine with digital sketches
- Demographic Aware: Preserves attribute distributions
- Memory Efficient: Fixed sketch size regardless of universe
⚠️ Limitations
- Panel Quality: Results depend on representative panel
- Hash Consistency: Requires same hashing across platforms
- Demographic Conflicts: Simple resolution strategies
- Parameter Tuning: D parameter needs optimization
- Approximation: Inherent estimation errors