Package Overview
CardinalityKit is a comprehensive Python package that organizes all cardinality estimation algorithms into a clean, professional toolkit. The package transforms research implementations into production-ready code with object-oriented interfaces, comprehensive documentation, and easy-to-use APIs.
Key Features
- Complete Algorithm Suite: All major cardinality estimation algorithms
- Extended Functionality: Demographic tracking and attribute analysis
- Sample Conversion: Transform sample data to HyperReal sketches
- Professional API: Clean, consistent object-oriented interface
- Backward Compatibility: Maintains compatibility with original research code
Package Structure
The package is organized into logical modules for easy navigation and usage:
cardinalitykit/
├── algorithms/ # Core estimation algorithms
│ ├── flajolet_martin.py # Original FM algorithm (1985)
│ ├── loglog.py # LogLog algorithm
│ ├── superloglog.py # SuperLogLog with outlier elimination
│ ├── hyperloglog.py # Industry-standard HLL
│ └── hyperreal.py # Unbiased HyperReal algorithm
├── extended/ # Extended algorithms with demographics
│ ├── extended_hyperloglog.py # HLL with attribute tracking
│ └── extended_hyperreal.py # HyperReal with attribute tracking
├── conversion/ # Sample data conversion
│ └── sample_to_hyperreal.py # Sample to HyperReal conversion
├── utils/ # Utility functions
│ ├── csv_parser.py # CSV parsing and hash generation
│ └── actual_count.py # Exact counting for validation
├── examples.py # Usage examples
└── README.md # Comprehensive documentation
Installation & Setup
The package can be easily integrated into any Python project:
# Add to Python path
import sys
sys.path.append('/path/to/CardinalityKit')
# Import the package
import cardinalitykit
from cardinalitykit import HyperLogLogEstimator, ExtendedHyperLogLogSketch
Dependencies
- numpy ≥ 1.19.0 - Numerical computations
- pandas ≥ 1.1.0 - Data manipulation
- tqdm ≥ 4.50.0 - Progress bars
- scipy ≥ 1.5.0 - Scientific computing
Core Algorithms
All core algorithms follow a consistent interface pattern for ease of use:
| Algorithm |
Memory |
Accuracy |
Use Case |
| Flajolet-Martin |
Low |
Basic |
Historical reference |
| LogLog |
Medium |
Good |
General purpose |
| SuperLogLog |
Medium |
Better |
Outlier elimination |
| HyperLogLog |
Medium |
Best |
Industry standard |
| HyperReal |
Medium |
Unbiased |
Research applications |
Basic Usage Pattern
from cardinalitykit import HyperLogLogEstimator
import hashlib
# Create estimator
hll = HyperLogLogEstimator(k=10)
# Process data
data = ["user_1", "user_2", "user_3", "user_1"]
for item in data:
hash_val = '{:32b}'.format(int(hashlib.sha256(item.encode()).hexdigest()[:8], 16))
hll.update(hash_val)
# Get results
estimate = hll.estimate()
memory_usage = hll.memory_usage()
print(f"Estimated cardinality: {estimate}")
print(f"Memory usage: {memory_usage} bytes")
Extended Algorithms
Extended algorithms add demographic tracking capabilities to standard cardinality estimation:
from cardinalitykit import ExtendedHyperLogLogSketch
# Create extended sketch
ehll = ExtendedHyperLogLogSketch(b_m=8, b_s=8)
# Process events with attributes
events = [
{'id_to_count': 'user_1', 'attribute': 'group_A'},
{'id_to_count': 'user_2', 'attribute': 'group_B'},
{'id_to_count': 'user_3', 'attribute': 'group_A'},
{'id_to_count': 'user_1', 'attribute': 'group_A'} # Duplicate
]
for event in events:
ehll.update_sketch(event)
# Get results
total_estimate = ehll.get_cardinality_estimate()
attribute_frequencies = ehll.get_frequency_for_attr()
print(f"Total cardinality: {total_estimate}")
print(f"Group A frequency: {ehll.get_frequency_for_attr('group_A')}")
print(f"All attributes: {attribute_frequencies}")
Key Features of Extended Algorithms:
- Track frequency counts per demographic attribute
- Maintain attribute samples for collision resolution
- Provide per-attribute cardinality estimates
- Support both HyperLogLog and HyperReal base algorithms
Sample Conversion
The conversion module enables transformation of sample data (e.g., survey panels) into HyperReal sketches:
from cardinalitykit import ExtendedHyperRealSketchFromSample
# Sample data: (id, weight, attribute)
sample_data = [
("sample_1", 0.4, "demographic_A"),
("sample_2", 0.3, "demographic_B"),
("sample_3", 0.3, "demographic_A")
]
# Create converter
converter = ExtendedHyperRealSketchFromSample(b_m=8, sample_data=sample_data)
# Run conversion (choose method)
converter.naive_associate(sum_weights=10000) # Simple method
# converter.fast_associate(sum_weights=10000, D=15) # Optimized method
# Get results
estimate = converter.get_cardinality_estimate()
attr_frequencies = converter.get_frequency_for_attr()
print(f"Converted cardinality: {estimate}")
for attr, freq in attr_frequencies.items():
print(f"{attr}: {freq}")
Conversion Methods
- Naive Associate: Simple one-to-one virtual person assignment
- Fast Associate: Optimized method with depth parameter for better accuracy
Utility Functions
The package includes comprehensive utilities for data processing and validation:
CSV Processing
from cardinalitykit.utils import CSVParser, parse_csv
# Create parser
parser = CSVParser(row_limit=1000) # Optional row limit
# Parse specific columns
data = parse_csv('data.csv', columns=[0, 1, 3])
# Generate hashes from column
hashes = parser.create_hashes_from_column('data.csv', column_index=1)
# Count lines
total_rows = parser.count_lines('data.csv')
Exact Counting for Validation
from cardinalitykit.utils import get_exact_unique_using_set
# Get exact count for comparison
exact_count, memory_used = get_exact_unique_using_set('data.csv', column_index=1)
print(f"Exact unique count: {exact_count}")
print(f"Memory used: {memory_used} bytes")
Complete Example
Here's a comprehensive example demonstrating algorithm comparison:
from cardinalitykit import *
import hashlib
def algorithm_comparison():
# Create sample data
data = [f"user_{i}" for i in range(5000)]
# Initialize algorithms
algorithms = {
'Flajolet-Martin': FlajoletMartinEstimator(),
'LogLog': LogLogEstimator(k=8),
'SuperLogLog': SuperLogLogEstimator(k=8),
'HyperLogLog': HyperLogLogEstimator(k=8),
'HyperReal': HyperRealEstimator(k=8)
}
# Process data
for name, alg in algorithms.items():
for item in data:
hash_val = '{:32b}'.format(int(hashlib.sha256(item.encode()).hexdigest()[:8], 16))
alg.update(hash_val)
# Compare results
actual = len(set(data))
print(f"Actual unique count: {actual}")
print("-" * 50)
for name, alg in algorithms.items():
estimate = alg.estimate()
error = abs(estimate - actual) / actual * 100
memory = alg.memory_usage()
print(f"{name:15}: {estimate:8.0f} ({error:5.2f}% error, {memory:4d} bytes)")
# Run comparison
algorithm_comparison()
API Reference
Core Algorithm Interface
__init__(k) - Initialize with k bits for bucket indexing
update(hash_value) - Process a hash value
estimate() - Get cardinality estimate
memory_usage() - Get memory usage in bytes
Extended Algorithm Interface
__init__(b_m, b_s) - Initialize with bucket and indicator bits
update_sketch(event) - Process event with id and attribute
get_cardinality_estimate() - Get total cardinality
get_frequency_for_attr(attr) - Get frequency for specific attribute
get_sketch() - Get internal data structures
Conversion Algorithm Interface
__init__(b_m, sample_data) - Initialize with sample data
naive_associate(sum_weights) - Simple conversion method
fast_associate(sum_weights, D) - Optimized conversion method
get_cardinality_estimate() - Get converted cardinality
get_frequency_for_attr(attr) - Get attribute frequencies
Best Practices
Algorithm Selection Guidelines
- HyperLogLog: Default choice for most applications
- HyperReal: When unbiased estimates are critical
- Extended versions: When demographic analysis is needed
- Sample conversion: For privacy-preserving analytics
Parameter Selection
- k parameter: Start with k=10 (1024 buckets) for most applications
- Higher k: Better accuracy but more memory usage
- Lower k: Less memory but reduced accuracy
- Extended algorithms: Use b_m = b_s for balanced performance
Performance Considerations
- Hash quality is critical - use strong hash functions
- Extended algorithms have higher memory overhead
- Sample conversion methods scale with virtual population size
- Batch processing improves performance for large datasets
Integration Examples
Data Pipeline Integration
import pandas as pd
from cardinalitykit import HyperLogLogEstimator
import hashlib
def process_dataframe(df, id_column):
"""Process pandas DataFrame with HyperLogLog"""
hll = HyperLogLogEstimator(k=12)
for user_id in df[id_column]:
hash_val = '{:32b}'.format(int(hashlib.sha256(str(user_id).encode()).hexdigest()[:8], 16))
hll.update(hash_val)
return hll.estimate()
# Usage
df = pd.read_csv('user_data.csv')
unique_users = process_dataframe(df, 'user_id')
print(f"Estimated unique users: {unique_users}")
Real-time Stream Processing
from cardinalitykit import HyperLogLogEstimator
import hashlib
class StreamProcessor:
def __init__(self, k=10):
self.hll = HyperLogLogEstimator(k=k)
self.processed_count = 0
def process_event(self, user_id):
"""Process single event from stream"""
hash_val = '{:32b}'.format(int(hashlib.sha256(str(user_id).encode()).hexdigest()[:8], 16))
self.hll.update(hash_val)
self.processed_count += 1
def get_current_estimate(self):
"""Get current cardinality estimate"""
return {
'unique_estimate': self.hll.estimate(),
'total_processed': self.processed_count,
'memory_usage': self.hll.memory_usage()
}
# Usage
processor = StreamProcessor(k=12)
# Process events as they arrive
processor.process_event('user_123')
processor.process_event('user_456')
stats = processor.get_current_estimate()
Production Deployment Tips
- Monitor memory usage in production environments
- Implement proper error handling for malformed data
- Consider sketch serialization for persistence
- Use appropriate logging for debugging and monitoring
The CardinalityKit package provides a complete, professional solution for cardinality estimation needs, from research and development to production deployment. Its clean API, comprehensive documentation, and extensive examples make it suitable for both academic research and industrial applications.