CardinalityKit Package - Cardinality Estimation Documentation

Package Overview

CardinalityKit is a comprehensive Python package that organizes all cardinality estimation algorithms into a clean, professional toolkit. The package transforms research implementations into production-ready code with object-oriented interfaces, comprehensive documentation, and easy-to-use APIs.

Key Features

Complete Algorithm Suite: All major cardinality estimation algorithms
Extended Functionality: Demographic tracking and attribute analysis
Sample Conversion: Transform sample data to HyperReal sketches
Professional API: Clean, consistent object-oriented interface
Backward Compatibility: Maintains compatibility with original research code

Package Structure

The package is organized into logical modules for easy navigation and usage:

cardinalitykit/
├── algorithms/          # Core estimation algorithms
│   ├── flajolet_martin.py    # Original FM algorithm (1985)
│   ├── loglog.py            # LogLog algorithm
│   ├── superloglog.py       # SuperLogLog with outlier elimination
│   ├── hyperloglog.py       # Industry-standard HLL
│   └── hyperreal.py         # Unbiased HyperReal algorithm
├── extended/           # Extended algorithms with demographics
│   ├── extended_hyperloglog.py   # HLL with attribute tracking
│   └── extended_hyperreal.py     # HyperReal with attribute tracking
├── conversion/         # Sample data conversion
│   └── sample_to_hyperreal.py    # Sample to HyperReal conversion
├── utils/             # Utility functions
│   ├── csv_parser.py         # CSV parsing and hash generation
│   └── actual_count.py       # Exact counting for validation
├── examples.py        # Usage examples
└── README.md         # Comprehensive documentation

Installation & Setup

The package can be easily integrated into any Python project:

# Add to Python path
import sys
sys.path.append('/path/to/CardinalityKit')

# Import the package
import cardinalitykit
from cardinalitykit import HyperLogLogEstimator, ExtendedHyperLogLogSketch

Dependencies

numpy ≥ 1.19.0 - Numerical computations
pandas ≥ 1.1.0 - Data manipulation
tqdm ≥ 4.50.0 - Progress bars
scipy ≥ 1.5.0 - Scientific computing

Core Algorithms

All core algorithms follow a consistent interface pattern for ease of use:

Algorithm	Memory	Accuracy	Use Case
Flajolet-Martin	Low	Basic	Historical reference
LogLog	Medium	Good	General purpose
SuperLogLog	Medium	Better	Outlier elimination
HyperLogLog	Medium	Best	Industry standard
HyperReal	Medium	Unbiased	Research applications

Basic Usage Pattern

from cardinalitykit import HyperLogLogEstimator
import hashlib

# Create estimator
hll = HyperLogLogEstimator(k=10)

# Process data
data = ["user_1", "user_2", "user_3", "user_1"]
for item in data:
    hash_val = '{:32b}'.format(int(hashlib.sha256(item.encode()).hexdigest()[:8], 16))
    hll.update(hash_val)

# Get results
estimate = hll.estimate()
memory_usage = hll.memory_usage()
print(f"Estimated cardinality: {estimate}")
print(f"Memory usage: {memory_usage} bytes")

Extended Algorithms

Extended algorithms add demographic tracking capabilities to standard cardinality estimation:

from cardinalitykit import ExtendedHyperLogLogSketch

# Create extended sketch
ehll = ExtendedHyperLogLogSketch(b_m=8, b_s=8)

# Process events with attributes
events = [
    {'id_to_count': 'user_1', 'attribute': 'group_A'},
    {'id_to_count': 'user_2', 'attribute': 'group_B'},
    {'id_to_count': 'user_3', 'attribute': 'group_A'},
    {'id_to_count': 'user_1', 'attribute': 'group_A'}  # Duplicate
]

for event in events:
    ehll.update_sketch(event)

# Get results
total_estimate = ehll.get_cardinality_estimate()
attribute_frequencies = ehll.get_frequency_for_attr()

print(f"Total cardinality: {total_estimate}")
print(f"Group A frequency: {ehll.get_frequency_for_attr('group_A')}")
print(f"All attributes: {attribute_frequencies}")

                        Key Features of Extended Algorithms:
                        Track frequency counts per demographic attribute
Maintain attribute samples for collision resolution
Provide per-attribute cardinality estimates
Support both HyperLogLog and HyperReal base algorithms

                    

Sample Conversion

The conversion module enables transformation of sample data (e.g., survey panels) into HyperReal sketches:

from cardinalitykit import ExtendedHyperRealSketchFromSample

# Sample data: (id, weight, attribute)
sample_data = [
    ("sample_1", 0.4, "demographic_A"),
    ("sample_2", 0.3, "demographic_B"),
    ("sample_3", 0.3, "demographic_A")
]

# Create converter
converter = ExtendedHyperRealSketchFromSample(b_m=8, sample_data=sample_data)

# Run conversion (choose method)
converter.naive_associate(sum_weights=10000)    # Simple method
# converter.fast_associate(sum_weights=10000, D=15)  # Optimized method

# Get results
estimate = converter.get_cardinality_estimate()
attr_frequencies = converter.get_frequency_for_attr()

print(f"Converted cardinality: {estimate}")
for attr, freq in attr_frequencies.items():
    print(f"{attr}: {freq}")

Conversion Methods

Naive Associate: Simple one-to-one virtual person assignment
Fast Associate: Optimized method with depth parameter for better accuracy

Utility Functions

The package includes comprehensive utilities for data processing and validation:

CSV Processing

from cardinalitykit.utils import CSVParser, parse_csv

# Create parser
parser = CSVParser(row_limit=1000)  # Optional row limit

# Parse specific columns
data = parse_csv('data.csv', columns=[0, 1, 3])

# Generate hashes from column
hashes = parser.create_hashes_from_column('data.csv', column_index=1)

# Count lines
total_rows = parser.count_lines('data.csv')

Exact Counting for Validation

from cardinalitykit.utils import get_exact_unique_using_set

# Get exact count for comparison
exact_count, memory_used = get_exact_unique_using_set('data.csv', column_index=1)
print(f"Exact unique count: {exact_count}")
print(f"Memory used: {memory_used} bytes")

Complete Example

Here's a comprehensive example demonstrating algorithm comparison:

from cardinalitykit import *
import hashlib

def algorithm_comparison():
    # Create sample data
    data = [f"user_{i}" for i in range(5000)]
    
    # Initialize algorithms
    algorithms = {
        'Flajolet-Martin': FlajoletMartinEstimator(),
        'LogLog': LogLogEstimator(k=8),
        'SuperLogLog': SuperLogLogEstimator(k=8),
        'HyperLogLog': HyperLogLogEstimator(k=8),
        'HyperReal': HyperRealEstimator(k=8)
    }
    
    # Process data
    for name, alg in algorithms.items():
        for item in data:
            hash_val = '{:32b}'.format(int(hashlib.sha256(item.encode()).hexdigest()[:8], 16))
            alg.update(hash_val)
    
    # Compare results
    actual = len(set(data))
    print(f"Actual unique count: {actual}")
    print("-" * 50)
    
    for name, alg in algorithms.items():
        estimate = alg.estimate()
        error = abs(estimate - actual) / actual * 100
        memory = alg.memory_usage()
        print(f"{name:15}: {estimate:8.0f} ({error:5.2f}% error, {memory:4d} bytes)")

# Run comparison
algorithm_comparison()

API Reference

Core Algorithm Interface

__init__(k) - Initialize with k bits for bucket indexing
update(hash_value) - Process a hash value
estimate() - Get cardinality estimate
memory_usage() - Get memory usage in bytes

Extended Algorithm Interface

__init__(b_m, b_s) - Initialize with bucket and indicator bits
update_sketch(event) - Process event with id and attribute
get_cardinality_estimate() - Get total cardinality
get_frequency_for_attr(attr) - Get frequency for specific attribute
get_sketch() - Get internal data structures

Conversion Algorithm Interface

__init__(b_m, sample_data) - Initialize with sample data
naive_associate(sum_weights) - Simple conversion method
fast_associate(sum_weights, D) - Optimized conversion method
get_cardinality_estimate() - Get converted cardinality
get_frequency_for_attr(attr) - Get attribute frequencies

Best Practices

                        Algorithm Selection Guidelines
                        HyperLogLog: Default choice for most applications
HyperReal: When unbiased estimates are critical
Extended versions: When demographic analysis is needed
Sample conversion: For privacy-preserving analytics

                    

Parameter Selection

k parameter: Start with k=10 (1024 buckets) for most applications
Higher k: Better accuracy but more memory usage
Lower k: Less memory but reduced accuracy
Extended algorithms: Use b_m = b_s for balanced performance

Performance Considerations

Hash quality is critical - use strong hash functions
Extended algorithms have higher memory overhead
Sample conversion methods scale with virtual population size
Batch processing improves performance for large datasets

Integration Examples

Data Pipeline Integration

import pandas as pd
from cardinalitykit import HyperLogLogEstimator
import hashlib

def process_dataframe(df, id_column):
    """Process pandas DataFrame with HyperLogLog"""
    hll = HyperLogLogEstimator(k=12)
    
    for user_id in df[id_column]:
        hash_val = '{:32b}'.format(int(hashlib.sha256(str(user_id).encode()).hexdigest()[:8], 16))
        hll.update(hash_val)
    
    return hll.estimate()

# Usage
df = pd.read_csv('user_data.csv')
unique_users = process_dataframe(df, 'user_id')
print(f"Estimated unique users: {unique_users}")

Real-time Stream Processing

from cardinalitykit import HyperLogLogEstimator
import hashlib

class StreamProcessor:
    def __init__(self, k=10):
        self.hll = HyperLogLogEstimator(k=k)
        self.processed_count = 0
    
    def process_event(self, user_id):
        """Process single event from stream"""
        hash_val = '{:32b}'.format(int(hashlib.sha256(str(user_id).encode()).hexdigest()[:8], 16))
        self.hll.update(hash_val)
        self.processed_count += 1
    
    def get_current_estimate(self):
        """Get current cardinality estimate"""
        return {
            'unique_estimate': self.hll.estimate(),
            'total_processed': self.processed_count,
            'memory_usage': self.hll.memory_usage()
        }

# Usage
processor = StreamProcessor(k=12)
# Process events as they arrive
processor.process_event('user_123')
processor.process_event('user_456')
stats = processor.get_current_estimate()

                        Production Deployment Tips
                        Monitor memory usage in production environments
Implement proper error handling for malformed data
Consider sketch serialization for persistence
Use appropriate logging for debugging and monitoring

                    

The CardinalityKit package provides a complete, professional solution for cardinality estimation needs, from research and development to production deployment. Its clean API, comprehensive documentation, and extensive examples make it suitable for both academic research and industrial applications.