[Cryptography] Bloom filter question

Mon Aug 4 13:14:56 EDT 2025

It appears that Jerry Leichter <leichter at lrw.com> said:
>In any case, the basic properties of a Bloom index shouldn’t depend too strongly on powerful notions of independence of the hashes.  I don’t recall ever seeing anything beyond some hand-wavy
>statements about using “different” hashes.

The statistical analyses assume the hashes are independent, but I don't get the
impression that the results are a lot worse if they're only mostly independent.

>Absent a deliberate attack - made way harder given the syntactic constraints on email addresses and their shortness - this seems like a valid approach to me.  If you can apply this after some
>normalization of the addresses - e.g., fixing the case after the “@“ - so much the better in making attacks even less plausible.

Right. For ASCII addresses we'll probably put them in all lower case. For UTF-8
addresses, normalization is a huge can of worms (case folding is language and
country dependent, among other things) but we can burn that bridge when we get
to it.

There are faster hashes than MD5, like FNV, that aren't cryptographically secure
but for this application I don't think that matters.

Incidentally, someone sent me references that led to this 1970 paper which did
the same thing I proposed, make an MD5 hash and use chunks of it as the bit
indices:

https://pages.cs.wisc.edu/~cao/papers/summary-cache/node9.html

R's,
John