[Cryptography] canonicalizing unicode strings.

Thu Feb 1 15:16:21 EST 2018

In article <20180201170839.GL10954 at localhost> you write:
>The point I was making is that mixed scripts are fine for identifiers,
>but that one must disallow subsequent new identifiers that look the same
>as existing identifiers (unless they are aliases of those).

Well, maybe.  Some scripts are mixed in the real world, e.g. kanji,
hiragana, katakana, latin, are mixed all the time in Japan and people
know how to deal with them.  But Indic scripts like Devangari have
extremely complex layout rules, and I would not want to guess what
something which was half Devangari and half Arabic would look like, or
how I would enter it.

Also keep in mind that for anything much more complex than accented
latin, the way you enter something on a keyboard and the Unicode that
you end up with often have little or nothing to do with each other.
For Chinese, people type ASCII pinyin and the input method picks the
characters, choosing among homophones by context or by popping up a
menu and asking.  (This includes people who know no English, but they
know enough of the sounds to type the pinyin.)  If my identifier or
password was half Chinese and half something else, I doubt that I
would enter it the same way on my laptop and my phone, or on an
Android and an iPhone, which would make it a pretty poor identifier.

This really is a well studied topic.  Anyone who cares about it should
go do some homework and let us know how people who've dealt with this
problem in the past have done.  I know what the answer for the DNS is,
NFC and IDNA and scripts and language tables, but the DNS is an
unusually sparse kind of identifier.

R's,
John