[Cryptography] Missing symbol annoyance with unicode technical standard.

Ron Garret ron at flownet.com
Tue Dec 17 16:12:54 EST 2024



> On Dec 17, 2024, at 12:52 PM, John Levine <johnl at iecc.com> wrote:
> 
> It appears that Ray Dillinger <bear at sonic.net> said:
>> I have longstanding annoyance with the many, many different facets of 
>> the Unicode standard's infamous, disastrous ability to form sequences of 
>> codepoints that look exactly like other, different sequences of 
>> codepoints. ...
> 
> In fairness, this is complaining that a screwdriver makes a lousy hammer.
> 
> Unicode's origins are in typography, where it has never been a problem that
> there is more than one way to typeset a character or that semantically different
> characters may look the same. Those of us of a certain age may remember manual
> typewriters that only had digits 2 through 9 because you used lower case l and
> capital O for the other two.
> 
> Unicode's goal is to make it possible to typeset text in every language in the
> world. They've gotten quite close to that, give or take novelty glyphs like your
> lock, and an enormous backlog of obscure Chinese characters that they are slowly
> working through.
> 
> Even in English there has never been a unique mapping from visual characters to
> codes (see 1/l and 0/O and m looks a lot like rn and ...) When you add
> diacriticals it gets worse, then add other scripts like Cyrillic and Greek,
> which have characters that are different from Latin but look identical.
> 
> This is not Unicode's fault. It is our fault for expecting Unicode to be
> something it is not. The reason we chose Unicode is the usual one, it is the
> worst option except for all of the others. The others are less complete or
> require fragile shift sequences, or are poorly defined, and often all three.
> 
> If you have a design that depends on being able to print or display something,
> and then people type in the exact code points you used to display it, you are
> always asking for trouble. It has sort of worked so far for domain names and
> email addresses but even there many people would be amazed at the fudges that go
> on behind the scenes to try and paper over the ambiguities.
> 
> My point here is "don't do that." You can probably get adequate uniqueness if
> you limit yourself to a small character set, with uppercase ASCII letters being
> the usual example, but don't push your luck.

This problem is not unique to typography.  The standard phonetic alphabet has two very similar-sounding words for 0 (zero) and S (sierra).  They look very different, but if you say them quickly over a noisy comm link they can become virtually indistinguishable.  And this is despite the fact that the phonetic alphabet was specifically designed to provide unambiguous comms.

And this problem exists in plain ascii too.  ln this paragraph there are four ascii characters that are not what you would expect from the context.  | chaIIenge you to find them all without doing a hexdump.

rg



More information about the cryptography mailing list