🔬 Fundamentals of Methodology


Understanding the Mathematical Foundations of Cardinality Estimation

The Foundation: Hashing Properties

The foundation of cardinality estimation methodology lies in hashing. Hashing creates a uniform expansion from input data into the binary universe, enabling probabilistic counting techniques.

🔍 The "Secret Zoom" Analogy

Think of hashing like using binoculars to discover stars in a seemingly starless part of the sky. Similarly, hashing takes strings without intrinsic meaning and expands them uniformly, allowing us to count distinct elements probabilistically.

Two Critical Hashing Properties

1. Uniform Expansion

The hash output should be uniformly distributed across the entire hash space to minimize collisions and ensure accurate probabilistic estimates.

2. Significance of Bits

Significant bits are on the right, least significant on the left. Each bit from right to left divides the hash space into two equal parts due to uniformity.

Hashing Implementation: From Simple to Cryptographic

Simple Hashing Example

Encoding "HELLO":

Step 1: ASCII Conversion

  • 'H' → 72 → 01001000
  • 'E' → 69 → 01000101
  • 'L' → 76 → 01001100
  • 'L' → 76 → 01001100
  • 'O' → 79 → 01001111

Result: 01001000 01000101 01001100 01001100 01001111

def simple_hash(text): key = 75 # Constant key hashed_text = bytearray() for byte in bytearray(text, 'ascii'): hashed_byte = (byte & 0xFF) hashed_byte = hashed_byte >> 1 # Right shift hashed_byte = hashed_byte ^ key # XOR with key hashed_text.append(hashed_byte) return bytes(hashed_text)

Problems with Simple Hashing:

  • 2 collisions per byte
  • Static key vulnerability
  • Small changes don't produce significantly different results

Cryptographic Hashing Properties

Deterministic

Same input always produces same hash value

Unique

Different inputs generate different hash values

Fast

Computationally efficient for large datasets

Uniform Distribution

Hash output uniformly distributed across hash space

Collision Resistance

Difficult to find two inputs with same hash

Avalanche Effect

Small input changes cause large hash differences

Double Hashing for Enhanced Entropy

CardinalityKit uses double hashing to increase entropy and improve uniformity:

def _hash_function(self, data): # Double hashing for increased entropy inner_hash = hashlib.sha256(str(data).encode('utf-8')).hexdigest() outer_hash = hashlib.sha256(str(inner_hash).encode('utf-8')).hexdigest() # Convert to 256-bit binary with zero padding hashed_ = '{:256b}'.format(int(outer_hash, 16)).replace(' ', '0') return hashed_

Binary Conversion Mathematics

Binary Conversion Formula:

Quotient₁ = ⌊N/2⌋, Remainder₁ = N mod 2

Quotient₂ = ⌊Quotient₁/2⌋, Remainder₂ = Quotient₁ mod 2

...

Binary = (Remainderₖ, Remainderₖ₋₁, ..., Remainder₂, Remainder₁)₂

Example: Converting 72 to Binary

  1. 72 ÷ 2 = 36, remainder 0
  2. 36 ÷ 2 = 18, remainder 0
  3. 18 ÷ 2 = 9, remainder 0
  4. 9 ÷ 2 = 4, remainder 1
  5. 4 ÷ 2 = 2, remainder 0
  6. 2 ÷ 2 = 1, remainder 0
  7. 1 ÷ 2 = 0, remainder 1

Result: 01001000 (reading remainders bottom to top)

Why Zero Counting Works

🎲 Coin Flip Analogy

HyperLogLog's zero counting is based on calculating how many coin flips are needed for a specific pattern. For three consecutive 'heads':

  • Probability of 'heads' = 0.5
  • Probability of 'HHH' = 0.5 × 0.5 × 0.5 = 0.125
  • Expected flips needed = 1/0.125 = 8 flips

This geometric distribution principle underlies cardinality estimation through zero counting.

Geometric Distribution Mean:

E[X] = 1/p

where p is the probability of success

From Hashing to Sketches

The uniform hashing properties enable us to:

  1. Bucket Assignment: Use first k bits to assign elements to 2^k buckets
  2. Rank Calculation: Count trailing zeros in remaining bits
  3. Cardinality Estimation: Apply mathematical formulas based on bucket statistics
  4. Bias Correction: Adjust estimates using algorithm-specific factors
# Example: Extracting bucket and rank from hash def process_hash(hash_binary, k): # First k bits determine bucket bucket = int(hash_binary[:k], 2) # Remaining bits used for rank calculation remaining_bits = hash_binary[k:] # Count trailing zeros (rank) rank = 1 for bit in reversed(remaining_bits): if bit == "0": rank += 1 else: break return bucket, rank

Key Takeaways

  • Uniform Distribution: Critical for accurate probabilistic estimates
  • Double Hashing: Increases entropy and reduces bias
  • Bit Significance: Right bits are most significant for uniformity
  • Zero Counting: Based on geometric distribution principles
  • Scalability: Memory usage scales as log(log(N)) vs linear for exact counting