[Cryptography] canonicalizing unicode strings.

Tue Jan 30 12:18:08 EST 2018

In article <b778ee29-b06b-e441-b4b6-ded040b730da at echeque.com> you write:
>>> What, however, is a script?

>Glancing at those details, I am pretty sure that there are always 
>legitimate reasons to mix Latin script characters with any other script, 
>or imperial Aramaic with Aramaic, etc.
>
>The attribution of characters to a particular script is acknowledged to 
>be substantially artificial, capricious, uncertain, and arbitrary.

No *it, Sherlock.

Unicode is a language for typesetting.  Their goal is that any text in
any living language and most dead ones can be represented as Unicode
characters, and any computer system with suitable fonts and rendering
software can display that text.  But it is a non-goal that there be a
unique Unicode representation of any particular text.  Unicode is full
of homoglyphs, different characters or character groups that display
exactly the same way, and semantic homoglyphs, characters that don't
look the same but that readers would consider equivalent, e.g.,
traditional and simplified Chinese.  You can go down endless ratholes
for languages such as Serbian and Belarusian that are written in both
Latin and Cyrilic, and whether the same word written in different
scripts is "the same".

This makes Unicode a lousy base to use for identifiers and passwords,
but it's what the world uses, so we make the best of it.  The Unicode
consortium and the IETF and ICANN have been working for over a decade
to try and figure out how to define usable Unicode subsets for
identifiers.  It's really hard, and I doubt anyone here will have any
deep insights that a dozen people other places haven't already had.

I blogged on this topic a few days ago here:

https://jl.ly/Internet/uniid.html

R's,
John