Fundamentals of Methodology - HyperReal Documentation

The Foundation: Hashing Properties

The foundation of cardinality estimation methodology lies in hashing. Hashing creates a uniform expansion from input data into the binary universe, enabling probabilistic counting techniques.

🔍 The "Secret Zoom" Analogy

Think of hashing like using binoculars to discover stars in a seemingly starless part of the sky. Similarly, hashing takes strings without intrinsic meaning and expands them uniformly, allowing us to count distinct elements probabilistically.

Two Critical Hashing Properties

1. Uniform Expansion

The hash output should be uniformly distributed across the entire hash space to minimize collisions and ensure accurate probabilistic estimates.

2. Significance of Bits

Significant bits are on the right, least significant on the left. Each bit from right to left divides the hash space into two equal parts due to uniformity.

Hashing Implementation: From Simple to Cryptographic

Simple Hashing Example

Encoding "HELLO":

Step 1: ASCII Conversion

'H' → 72 → 01001000
'E' → 69 → 01000101
'L' → 76 → 01001100
'L' → 76 → 01001100
'O' → 79 → 01001111

Result: 01001000 01000101 01001100 01001100 01001111

def simple_hash(text):
    key = 75  # Constant key
    hashed_text = bytearray()
    
    for byte in bytearray(text, 'ascii'):
        hashed_byte = (byte & 0xFF)
        hashed_byte = hashed_byte >> 1  # Right shift
        hashed_byte = hashed_byte ^ key  # XOR with key
        hashed_text.append(hashed_byte)
    
    return bytes(hashed_text)
            

Problems with Simple Hashing:

2 collisions per byte
Static key vulnerability
Small changes don't produce significantly different results

Cryptographic Hashing Properties

Deterministic

Same input always produces same hash value

Unique

Different inputs generate different hash values

Fast

Computationally efficient for large datasets

Uniform Distribution

Hash output uniformly distributed across hash space

Collision Resistance

Difficult to find two inputs with same hash

Avalanche Effect

Small input changes cause large hash differences

Double Hashing for Enhanced Entropy

CardinalityKit uses double hashing to increase entropy and improve uniformity:

def _hash_function(self, data):
    # Double hashing for increased entropy
    inner_hash = hashlib.sha256(str(data).encode('utf-8')).hexdigest()
    outer_hash = hashlib.sha256(str(inner_hash).encode('utf-8')).hexdigest()
    
    # Convert to 256-bit binary with zero padding
    hashed_ = '{:256b}'.format(int(outer_hash, 16)).replace(' ', '0')
    
    return hashed_
                

Binary Conversion Mathematics

Binary Conversion Formula:

Quotient₁ = ⌊N/2⌋, Remainder₁ = N mod 2

Quotient₂ = ⌊Quotient₁/2⌋, Remainder₂ = Quotient₁ mod 2

...

Binary = (Remainderₖ, Remainderₖ₋₁, ..., Remainder₂, Remainder₁)₂

Example: Converting 72 to Binary

72 ÷ 2 = 36, remainder 0
36 ÷ 2 = 18, remainder 0
18 ÷ 2 = 9, remainder 0
9 ÷ 2 = 4, remainder 1
4 ÷ 2 = 2, remainder 0
2 ÷ 2 = 1, remainder 0
1 ÷ 2 = 0, remainder 1

Result: 01001000 (reading remainders bottom to top)

Why Zero Counting Works

🎲 Coin Flip Analogy

HyperLogLog's zero counting is based on calculating how many coin flips are needed for a specific pattern. For three consecutive 'heads':

Probability of 'heads' = 0.5
Probability of 'HHH' = 0.5 × 0.5 × 0.5 = 0.125
Expected flips needed = 1/0.125 = 8 flips

This geometric distribution principle underlies cardinality estimation through zero counting.

Geometric Distribution Mean:

E[X] = 1/p

where p is the probability of success

From Hashing to Sketches

The uniform hashing properties enable us to:

Bucket Assignment: Use first k bits to assign elements to 2^k buckets
Rank Calculation: Count trailing zeros in remaining bits
Cardinality Estimation: Apply mathematical formulas based on bucket statistics
Bias Correction: Adjust estimates using algorithm-specific factors

# Example: Extracting bucket and rank from hash
def process_hash(hash_binary, k):
    # First k bits determine bucket
    bucket = int(hash_binary[:k], 2)
    
    # Remaining bits used for rank calculation
    remaining_bits = hash_binary[k:]
    
    # Count trailing zeros (rank)
    rank = 1
    for bit in reversed(remaining_bits):
        if bit == "0":
            rank += 1
        else:
            break
    
    return bucket, rank
                

Key Takeaways

Uniform Distribution: Critical for accurate probabilistic estimates
Double Hashing: Increases entropy and reduces bias
Bit Significance: Right bits are most significant for uniformity
Zero Counting: Based on geometric distribution principles
Scalability: Memory usage scales as log(log(N)) vs linear for exact counting