[Cryptography] canonicalizing unicode strings.

Wed Jan 31 11:49:21 EST 2018

On Wed, Jan 31, 2018 at 05:04:18PM +0800, jamesd at echeque.com wrote:
> On 31/01/2018 08:06, Nico Williams wrote:
> >Algorithms for detection of homoglyph identifiers that match existing
> >ones is a more urgent need.
>
> Attempts to restrict people to only using one script in an identifier are
> not going to fly,

That's why I did not propose that :)

>                   but if someone uses more than one script, we need to check
> against all potentially conflicting identifiers for homoglyphs.

And even if they use just one script.  Remember that 'l' and '1' look
similar in many fonts, and those are just in the plain old ASCII range.

Mixing Latin ligatures with Latin non-ligatures is not mixing scripts
either.

Diacritic marks can make confusable characters too.

> To efficiently check for homoglyphic identifiers, have to canonicalize all
> homoglyphs -

For string hashing, in particular, but for string hashing one admits
collisions, so one can choose to be very liberal in "canonicalizing"
characters for string hashing.  The biggest problem for string hashing
is that the hashing algorithm needs to be stable for persistent storage
(but can't be), so storage formats need to be able to version string
hashing.

> And no official list of homoglyphs, or official software to canonicalize
> them.

There is, actually.  See UTS#39.  I expect the confusables.txt file to
grow over time, not least because Unicode is not closed to new scripts.

Nico
--