The foundation of cardinality estimation methodology lies in hashing. Hashing creates a uniform expansion from input data into the binary universe, enabling probabilistic counting techniques.
Think of hashing like using binoculars to discover stars in a seemingly starless part of the sky. Similarly, hashing takes strings without intrinsic meaning and expands them uniformly, allowing us to count distinct elements probabilistically.
The hash output should be uniformly distributed across the entire hash space to minimize collisions and ensure accurate probabilistic estimates.
Significant bits are on the right, least significant on the left. Each bit from right to left divides the hash space into two equal parts due to uniformity.
Step 1: ASCII Conversion
Result: 01001000 01000101 01001100 01001100 01001111
Problems with Simple Hashing:
Same input always produces same hash value
Different inputs generate different hash values
Computationally efficient for large datasets
Hash output uniformly distributed across hash space
Difficult to find two inputs with same hash
Small input changes cause large hash differences
CardinalityKit uses double hashing to increase entropy and improve uniformity:
Quotient₁ = ⌊N/2⌋, Remainder₁ = N mod 2
Quotient₂ = ⌊Quotient₁/2⌋, Remainder₂ = Quotient₁ mod 2
...
Binary = (Remainderₖ, Remainderₖ₋₁, ..., Remainder₂, Remainder₁)₂
Result: 01001000 (reading remainders bottom to top)
HyperLogLog's zero counting is based on calculating how many coin flips are needed for a specific pattern. For three consecutive 'heads':
This geometric distribution principle underlies cardinality estimation through zero counting.
E[X] = 1/p
where p is the probability of success
The uniform hashing properties enable us to: