Package Overview

CardinalityKit is a comprehensive Python package that organizes all cardinality estimation algorithms into a clean, professional toolkit. The package transforms research implementations into production-ready code with object-oriented interfaces, comprehensive documentation, and easy-to-use APIs.

Key Features

  • Complete Algorithm Suite: All major cardinality estimation algorithms
  • Extended Functionality: Demographic tracking and attribute analysis
  • Sample Conversion: Transform sample data to HyperReal sketches
  • Professional API: Clean, consistent object-oriented interface
  • Backward Compatibility: Maintains compatibility with original research code

Package Structure

The package is organized into logical modules for easy navigation and usage:

cardinalitykit/ ├── algorithms/ # Core estimation algorithms │ ├── flajolet_martin.py # Original FM algorithm (1985) │ ├── loglog.py # LogLog algorithm │ ├── superloglog.py # SuperLogLog with outlier elimination │ ├── hyperloglog.py # Industry-standard HLL │ └── hyperreal.py # Unbiased HyperReal algorithm ├── extended/ # Extended algorithms with demographics │ ├── extended_hyperloglog.py # HLL with attribute tracking │ └── extended_hyperreal.py # HyperReal with attribute tracking ├── conversion/ # Sample data conversion │ └── sample_to_hyperreal.py # Sample to HyperReal conversion ├── utils/ # Utility functions │ ├── csv_parser.py # CSV parsing and hash generation │ └── actual_count.py # Exact counting for validation ├── examples.py # Usage examples └── README.md # Comprehensive documentation

Installation & Setup

The package can be easily integrated into any Python project:

# Add to Python path import sys sys.path.append('/path/to/CardinalityKit') # Import the package import cardinalitykit from cardinalitykit import HyperLogLogEstimator, ExtendedHyperLogLogSketch

Dependencies

  • numpy ≥ 1.19.0 - Numerical computations
  • pandas ≥ 1.1.0 - Data manipulation
  • tqdm ≥ 4.50.0 - Progress bars
  • scipy ≥ 1.5.0 - Scientific computing

Core Algorithms

All core algorithms follow a consistent interface pattern for ease of use:

Algorithm Memory Accuracy Use Case
Flajolet-Martin Low Basic Historical reference
LogLog Medium Good General purpose
SuperLogLog Medium Better Outlier elimination
HyperLogLog Medium Best Industry standard
HyperReal Medium Unbiased Research applications

Basic Usage Pattern

from cardinalitykit import HyperLogLogEstimator import hashlib # Create estimator hll = HyperLogLogEstimator(k=10) # Process data data = ["user_1", "user_2", "user_3", "user_1"] for item in data: hash_val = '{:32b}'.format(int(hashlib.sha256(item.encode()).hexdigest()[:8], 16)) hll.update(hash_val) # Get results estimate = hll.estimate() memory_usage = hll.memory_usage() print(f"Estimated cardinality: {estimate}") print(f"Memory usage: {memory_usage} bytes")

Extended Algorithms

Extended algorithms add demographic tracking capabilities to standard cardinality estimation:

from cardinalitykit import ExtendedHyperLogLogSketch # Create extended sketch ehll = ExtendedHyperLogLogSketch(b_m=8, b_s=8) # Process events with attributes events = [ {'id_to_count': 'user_1', 'attribute': 'group_A'}, {'id_to_count': 'user_2', 'attribute': 'group_B'}, {'id_to_count': 'user_3', 'attribute': 'group_A'}, {'id_to_count': 'user_1', 'attribute': 'group_A'} # Duplicate ] for event in events: ehll.update_sketch(event) # Get results total_estimate = ehll.get_cardinality_estimate() attribute_frequencies = ehll.get_frequency_for_attr() print(f"Total cardinality: {total_estimate}") print(f"Group A frequency: {ehll.get_frequency_for_attr('group_A')}") print(f"All attributes: {attribute_frequencies}")
Key Features of Extended Algorithms:
  • Track frequency counts per demographic attribute
  • Maintain attribute samples for collision resolution
  • Provide per-attribute cardinality estimates
  • Support both HyperLogLog and HyperReal base algorithms

Sample Conversion

The conversion module enables transformation of sample data (e.g., survey panels) into HyperReal sketches:

from cardinalitykit import ExtendedHyperRealSketchFromSample # Sample data: (id, weight, attribute) sample_data = [ ("sample_1", 0.4, "demographic_A"), ("sample_2", 0.3, "demographic_B"), ("sample_3", 0.3, "demographic_A") ] # Create converter converter = ExtendedHyperRealSketchFromSample(b_m=8, sample_data=sample_data) # Run conversion (choose method) converter.naive_associate(sum_weights=10000) # Simple method # converter.fast_associate(sum_weights=10000, D=15) # Optimized method # Get results estimate = converter.get_cardinality_estimate() attr_frequencies = converter.get_frequency_for_attr() print(f"Converted cardinality: {estimate}") for attr, freq in attr_frequencies.items(): print(f"{attr}: {freq}")

Conversion Methods

  • Naive Associate: Simple one-to-one virtual person assignment
  • Fast Associate: Optimized method with depth parameter for better accuracy

Utility Functions

The package includes comprehensive utilities for data processing and validation:

CSV Processing

from cardinalitykit.utils import CSVParser, parse_csv # Create parser parser = CSVParser(row_limit=1000) # Optional row limit # Parse specific columns data = parse_csv('data.csv', columns=[0, 1, 3]) # Generate hashes from column hashes = parser.create_hashes_from_column('data.csv', column_index=1) # Count lines total_rows = parser.count_lines('data.csv')

Exact Counting for Validation

from cardinalitykit.utils import get_exact_unique_using_set # Get exact count for comparison exact_count, memory_used = get_exact_unique_using_set('data.csv', column_index=1) print(f"Exact unique count: {exact_count}") print(f"Memory used: {memory_used} bytes")

Complete Example

Here's a comprehensive example demonstrating algorithm comparison:

from cardinalitykit import * import hashlib def algorithm_comparison(): # Create sample data data = [f"user_{i}" for i in range(5000)] # Initialize algorithms algorithms = { 'Flajolet-Martin': FlajoletMartinEstimator(), 'LogLog': LogLogEstimator(k=8), 'SuperLogLog': SuperLogLogEstimator(k=8), 'HyperLogLog': HyperLogLogEstimator(k=8), 'HyperReal': HyperRealEstimator(k=8) } # Process data for name, alg in algorithms.items(): for item in data: hash_val = '{:32b}'.format(int(hashlib.sha256(item.encode()).hexdigest()[:8], 16)) alg.update(hash_val) # Compare results actual = len(set(data)) print(f"Actual unique count: {actual}") print("-" * 50) for name, alg in algorithms.items(): estimate = alg.estimate() error = abs(estimate - actual) / actual * 100 memory = alg.memory_usage() print(f"{name:15}: {estimate:8.0f} ({error:5.2f}% error, {memory:4d} bytes)") # Run comparison algorithm_comparison()

API Reference

Core Algorithm Interface

  • __init__(k) - Initialize with k bits for bucket indexing
  • update(hash_value) - Process a hash value
  • estimate() - Get cardinality estimate
  • memory_usage() - Get memory usage in bytes

Extended Algorithm Interface

  • __init__(b_m, b_s) - Initialize with bucket and indicator bits
  • update_sketch(event) - Process event with id and attribute
  • get_cardinality_estimate() - Get total cardinality
  • get_frequency_for_attr(attr) - Get frequency for specific attribute
  • get_sketch() - Get internal data structures

Conversion Algorithm Interface

  • __init__(b_m, sample_data) - Initialize with sample data
  • naive_associate(sum_weights) - Simple conversion method
  • fast_associate(sum_weights, D) - Optimized conversion method
  • get_cardinality_estimate() - Get converted cardinality
  • get_frequency_for_attr(attr) - Get attribute frequencies

Best Practices

Algorithm Selection Guidelines

  • HyperLogLog: Default choice for most applications
  • HyperReal: When unbiased estimates are critical
  • Extended versions: When demographic analysis is needed
  • Sample conversion: For privacy-preserving analytics

Parameter Selection

  • k parameter: Start with k=10 (1024 buckets) for most applications
  • Higher k: Better accuracy but more memory usage
  • Lower k: Less memory but reduced accuracy
  • Extended algorithms: Use b_m = b_s for balanced performance

Performance Considerations

  • Hash quality is critical - use strong hash functions
  • Extended algorithms have higher memory overhead
  • Sample conversion methods scale with virtual population size
  • Batch processing improves performance for large datasets

Integration Examples

Data Pipeline Integration

import pandas as pd from cardinalitykit import HyperLogLogEstimator import hashlib def process_dataframe(df, id_column): """Process pandas DataFrame with HyperLogLog""" hll = HyperLogLogEstimator(k=12) for user_id in df[id_column]: hash_val = '{:32b}'.format(int(hashlib.sha256(str(user_id).encode()).hexdigest()[:8], 16)) hll.update(hash_val) return hll.estimate() # Usage df = pd.read_csv('user_data.csv') unique_users = process_dataframe(df, 'user_id') print(f"Estimated unique users: {unique_users}")

Real-time Stream Processing

from cardinalitykit import HyperLogLogEstimator import hashlib class StreamProcessor: def __init__(self, k=10): self.hll = HyperLogLogEstimator(k=k) self.processed_count = 0 def process_event(self, user_id): """Process single event from stream""" hash_val = '{:32b}'.format(int(hashlib.sha256(str(user_id).encode()).hexdigest()[:8], 16)) self.hll.update(hash_val) self.processed_count += 1 def get_current_estimate(self): """Get current cardinality estimate""" return { 'unique_estimate': self.hll.estimate(), 'total_processed': self.processed_count, 'memory_usage': self.hll.memory_usage() } # Usage processor = StreamProcessor(k=12) # Process events as they arrive processor.process_event('user_123') processor.process_event('user_456') stats = processor.get_current_estimate()

Production Deployment Tips

  • Monitor memory usage in production environments
  • Implement proper error handling for malformed data
  • Consider sketch serialization for persistence
  • Use appropriate logging for debugging and monitoring

The CardinalityKit package provides a complete, professional solution for cardinality estimation needs, from research and development to production deployment. Its clean API, comprehensive documentation, and extensive examples make it suitable for both academic research and industrial applications.