[Cryptography] canonicalizing unicode strings.

Tue Jan 30 19:06:26 EST 2018

On Tue, Jan 30, 2018 at 12:18:08PM -0500, John Levine wrote:
> Unicode is a language for typesetting.  Their goal is that any text in

Well, a small portion of typesetting (there's no way to format text, no
way to control positioning on a page, etc.).

> any living language and most dead ones can be represented as Unicode
> characters, and any computer system with suitable fonts and rendering
> software can display that text.  But it is a non-goal that there be a
> unique Unicode representation of any particular text.  Unicode is full
> of homoglyphs, different characters or character groups that display
> exactly the same way, and semantic homoglyphs, characters that don't
> look the same but that readers would consider equivalent, e.g.,
> traditional and simplified Chinese.  You can go down endless ratholes
> for languages such as Serbian and Belarusian that are written in both
> Latin and Cyrilic, and whether the same word written in different
> scripts is "the same".

It was never going to be any other way.  This isn't a result of design
choices that went into Unicode.  It's a result of human scripts being...
as they are: organically evolved, and plentiful.  Perhaps some design
choices could have been made that reduced some of these homoglyph cases,
but there would have been annoying trade-offs anyways.

We had homoglyph problems before computers ('l' vs '1', for example).
They got worse not because of Unicode, but because the world is more
connected now.  Yes, we could, eg, have had a unified CJK codepoint
assignment set, but it turns out people didn't want that.

Basically, we just have to accept these issues and deal with them as
best we can: with code to heuristically detect phishing based on
homoglyphs, and code to fuzzily match Unicode identifiers.  Such code
has to evolve as scripts are added to Unicode and/or new homoglyph sets
are discovered (if we don't already know all of them).

> This makes Unicode a lousy base to use for identifiers and passwords,
> but it's what the world uses, so we make the best of it.  The Unicode

No alternative to Unicode that somehow supports multiple scripts could
have been a very good choice for expressing identifiers.  And since
Unicode exists and is so widely used now, even if there was such an
alternative, we'd have to convert to/from that alternative and thus we'd
only just barely have moved the boundary at which these problems arise.

There's no point saying that this "makes Unicode a lousy base to use for
identifiers and passwords".

And for passwords homoglyphs are mostly a non-issue.  The primary issue
for passwords is the user's ability to enter them correctly on all their
devices.  And, of course, normalization is kinda required, since the
user generally has no control over pre-composition choices of the input
method.

> consortium and the IETF and ICANN have been working for over a decade
> to try and figure out how to define usable Unicode subsets for
> identifiers.  It's really hard, and I doubt anyone here will have any
> deep insights that a dozen people other places haven't already had.

I'm not sure it's worthwhile to do this, and I think I'd likely object.

In gaming and social media (e.g., reddit) people like to use identifiers
that one would never see used in, e.g., corporate networks.  I've
objected before to SASL proscribing a variety of characters that people
like to use in gaming:   why should we say no to using, say, '*', in
gaming usernames[0]?  But homoglyph usernames are a problem, if nothing
else for phishing / impersonation reasons.

Algorithms for detection of homoglyph identifiers that match existing
ones is a more urgent need.

Nico

[0] Or ':'.  Yes, ':' is very very bad for Unix user/group names, but
    for SASL cids, who cares?