[Cryptography] Bloom filter question

Tue Aug 5 19:18:28 EDT 2025

It appears that Michael Kjörling <9bf3a7ef93bb at ewoof.net> said:
>On 4 Aug 2025 13:14 -0400, from johnl at iecc.com (John Levine):
>>> If you can apply this after some normalization of the addresses -
>>> e.g., fixing the case after the “@“ - so much the better in making
>>> attacks even less plausible.
>> 
>> Right. For ASCII addresses we'll probably put them in all lower case. For UTF-8
>> addresses, normalization is a huge can of worms (case folding is language and
>> country dependent, among other things) but we can burn that bridge when we get
>> to it.
>
>Just use the Punycode encoding of the required domain name label(s)

There is a 1-1 mapping between punycode A-labels and UTF-8 U-labels so either
works equually well. See the IDNA RFCs for more than you want to know about this
topic.

The hard part is the mailbox, the part before the @ sign. In internationalized
mail it can in principle contain any Unicode code points, and there is no
agreement at all about which code points should be allowed in real addresses,
and what sort of normalizations to use (NFC and NFKC are both plausible) or if
and when to try case folding.

Not to be unduly snarky, but assigning semantics to Unicode strings and deciding
which ones are equivalent is an astonishingly compicated topic. It's safe to say
that if someone thinks he has a simple way to do it, that shows he doesn't
understand the problem.  I sure don't think I have a good answer.

But anyway, like I said, we can leave the UTF-8 addresses for later because this
is not the only place where we run into the issue of whether two addresses are
"the same" for ill-defined versions of "same".

R's,
John