[Cryptography] canonicalizing unicode strings.

Thu Feb 1 17:10:21 EST 2018

On Thu, Feb 01, 2018 at 03:16:21PM -0500, John Levine wrote:
> In article <20180201170839.GL10954 at localhost> you write:
> >The point I was making is that mixed scripts are fine for identifiers,
> >but that one must disallow subsequent new identifiers that look the same
> >as existing identifiers (unless they are aliases of those).
> 
> Well, maybe.  Some scripts are mixed in the real world, e.g. kanji,
> hiragana, katakana, latin, are mixed all the time in Japan and people
> know how to deal with them.  But Indic scripts like Devangari have
> extremely complex layout rules, and I would not want to guess what
> something which was half Devangari and half Arabic would look like, or
> how I would enter it.

But usage evolves.  So, for example, in Korea adding an "ing" Latin
suffix is a common thing now.  Suppose that became popular among
Devangari users?  Or suppose they mix digits from Latin in otherwise
Devangari text.

And while I have no idea what "half Devangari and half Arabic would look
like", but I'm guessing that's not such a strange idea given that India
is quite the cultural melting pot.

> Also keep in mind that for anything much more complex than accented
> latin, the way you enter something on a keyboard and the Unicode that
> you end up with often have little or nothing to do with each other.

Yeah, I know.  I assume a lot of script mixing involving two or more
non-Latin scripts are... difficult to enter (except by pasting :).

> For Chinese, people type ASCII pinyin and the input method picks the
> characters, choosing among homophones by context or by popping up a
> menu and asking.  (This includes people who know no English, but they

That's how people enter Kanji, only they type in hiragana instead of
romaji.

> know enough of the sounds to type the pinyin.)  If my identifier or
> password was half Chinese and half something else, I doubt that I
> would enter it the same way on my laptop and my phone, or on an
> Android and an iPhone, which would make it a pretty poor identifier.

Correct.  Without being able to view what you're entering, password
entry can be very difficult in some scripts (though if the input method
is predictable, maybe you can just use muscle memory...).

None of this means that one should reject mixed script new passwords.
However, users should be warned about difficulty of password entry.

And back to identifiers, again, we (for a value of "we" that roughly
means "IETF" here) shouldn't forbid script mixing for them either.
However, administrators certainly could in specific contexts.

> This really is a well studied topic.  Anyone who cares about it should
> go do some homework and let us know how people who've dealt with this
> problem in the past have done.  I know what the answer for the DNS is,
> NFC and IDNA and scripts and language tables, but the DNS is an
> unusually sparse kind of identifier.

+1