[Cryptography] Missing symbol annoyance with unicode technical standard.

Tue Dec 17 15:52:19 EST 2024

It appears that Ray Dillinger <bear at sonic.net> said:
>I have longstanding annoyance with the many, many different facets of 
>the Unicode standard's infamous, disastrous ability to form sequences of 
>codepoints that look exactly like other, different sequences of 
>codepoints. ...

In fairness, this is complaining that a screwdriver makes a lousy hammer.

Unicode's origins are in typography, where it has never been a problem that
there is more than one way to typeset a character or that semantically different
characters may look the same. Those of us of a certain age may remember manual
typewriters that only had digits 2 through 9 because you used lower case l and
capital O for the other two.

Unicode's goal is to make it possible to typeset text in every language in the
world. They've gotten quite close to that, give or take novelty glyphs like your
lock, and an enormous backlog of obscure Chinese characters that they are slowly
working through.

Even in English there has never been a unique mapping from visual characters to
codes (see 1/l and 0/O and m looks a lot like rn and ...) When you add
diacriticals it gets worse, then add other scripts like Cyrillic and Greek,
which have characters that are different from Latin but look identical.

This is not Unicode's fault. It is our fault for expecting Unicode to be
something it is not. The reason we chose Unicode is the usual one, it is the
worst option except for all of the others. The others are less complete or
require fragile shift sequences, or are poorly defined, and often all three.

If you have a design that depends on being able to print or display something,
and then people type in the exact code points you used to display it, you are
always asking for trouble. It has sort of worked so far for domain names and
email addresses but even there many people would be amazed at the fudges that go
on behind the scenes to try and paper over the ambiguities.

My point here is "don't do that." You can probably get adequate uniqueness if
you limit yourself to a small character set, with uppercase ASCII letters being
the usual example, but don't push your luck.

R's,
John