[Cryptography] canonicalizing unicode strings.

Jerry Leichter leichter at lrw.com
Wed Jan 17 11:29:49 EST 2018


>> I think the normal approach is to accept strings only in a single
>> script.  Mixed scripts are generally malicious in any sort of
>> identifier context.
> 
> Except that, for example, an email address may have a non-LATIN
> localpart alongside a LATIN (ASCII) domain name.  Or some of the
> labels of a domain may be in a different script that the parent
> domain.  (I have духовный.org, in which the first label is
> Cyrillic, and .org is of course Latin US-ASCII).
... and of course then you mix digits with letters from some script - which is common but not universal, as some scripts have their own representations, which they may or may not use depending on context.

Everything about natural languages is more complicated than you think.  The particular languages and representations you're familiar with - no matter how cosmopolitan you are - are pretty certain *not* to cover all the variations out there.
                                                        -- Jerry



More information about the cryptography mailing list