[Cryptography] Bloom filter question

Jon Callas jon at callas.org
Mon Aug 4 16:32:53 EDT 2025



> On Aug 2, 2025, at 12:04, John Levine <johnl at iecc.com> wrote:
> 
> For a mail security thing I am working on, I would like to encode the addresses
> the message was sent to as part of the signature in a way that makes it easy to
> check if an address is in the signature, but hard to recover the addresses from
> the signature. This seems like a job for a Bloom filter. Looking at the
> Wikipedia article, if I expect to have up to 10 addresses, it appears 128
> bits and 16 hash functions would be plenty. So I have to find 16 different hash
> functions.
> 
> So how about this: I make an MD5 hash of an address, break that up into 16 bytes,
> and use the 7 low bits of each byte as the bit index.  Is it reasonable to use
> the 16 bytes of an MD5 hash as 16 different mini-hashes?
> 
> I realize that MD5 is far from state of the art, but it is about twice as fast
> as SHA-2, and Bloom filters by their nature have occasional false positives,
> which is OK for this application. I don't think I care if someone can invent an
> implausible address that happens to hash the same as a real one.

Permit me a digression below; bottom line is that I think MD5 is okay for your use, and that SipHash might be better.

There's an underlying set of questions about what to use. The most important one is how much you care about duplicates (collisions to cryptographers). Some applications care a lot, like digital signatures, because that signature data object can be plucked from where it was created and reused to create verifiable lies.

On the other end of the scale, there's hash tables, where collisions are normal and expected. The hash table is a shortcut that trades off a computation to avoid linear searching. The hash is the index of which quote-quote data bucket to use, and duplicates are expected and normal. In an implementation, too many duplicates just leads to a rebuild of the whole thing -- add one more bucket and then re-bucket everything.

Flipping this on its head, you can think of a digital signature as a use of a hash where the bucket size must be one or zero. They're not buckets so much as pigeonholes, and if there are ever two pigeons in one hole, something's gone very wrong. The construction of the hash, therefore is complex because duplicates are a system failure.

Your use case, Bloom filters, are far more like hash tables than digital signatures. Duplicates are normal and the system compensates for that. MD5 will work because it's good enough. Moreover, SipHash, <https://en.wikipedia.org/wiki/SipHash>, will also work fine and was designed for precisely this case, where you don't need a heavyweight hash function because you don't care about its most important properties (like a lack of collisions and preimage resistance). It's also much faster than MD5, and yet secure enough that it handles issues like network packet authentication to stop DoS attacks.

You can use MD5. You can also use SipHash and it's better for your use case, anyway in terms of speed and algorithmic guarantees. There are even non-cryptographic hashes that would be good enough.

Lastly, there's another reason to use SipHash (or something else), and that is that there are people who will get triggered by MD5 and their brains will shut down. "MD5? Slowly, I turned, step by step..." They're going to tell you that you can't use MD5, it's not secure. They will not listen to you when you say that what you're really using it for is a sequence of 16-bit quantities that are mostly pseudorandom, and not anything else. And that 16-bit quantities are gonna have collisions because they're that small. Even if you don't dignify it with a response, people are going to squeal that it's bad. Alternatively, if you use SipHash (or others), you head that off. The limited claims SipHash makes are fine, they've probably never heard of it, and you will cut down the amount of BS you'll have to deal with by two to three orders of magnitude. Maybe four, even.

	Jon




More information about the cryptography mailing list