[Cryptography] Bloom filter question

Mon Aug 4 01:06:52 EDT 2025

> [I] make an MD5 hash of an address, break that up into 16 bytes,
> and use the 7 low bits of each byte as the bit index.  Is it reasonable to use
> the 16 bytes of an MD5 hash as 16 different mini-hashes [for a Bloom filter]?
Let’s ask the question another way:  Suppose I had 15 of the bytes of the MD5 hash of an unknown value.  Would I have any significant advantage in predicting any bit of the 16th?  As far as I know, no one has ever demonstrated that - the breaks are pre-image breaks of specialized sorts.  And since you are discarding 15 bits you’re making the prediction problem harder anyway.  (You’re making the pre-image problem easier of course.)

In any case, the basic properties of a Bloom index shouldn’t depend too strongly on powerful notions of independence of the hashes.  I don’t recall ever seeing anything beyond some hand-wavy statements about using “different” hashes.

Absent a deliberate attack - made way harder given the syntactic constraints on email addresses and their shortness - this seems like a valid approach to me.  If you can apply this after some normalization of the addresses - e.g., fixing the case after the “@“ - so much the better in making attacks even less plausible.

                                          -- Jerry