[Cryptography] Source code that looks like completely different source code

John Levine johnl at taugh.com
Mon Dec 13 22:52:00 EST 2021


According to Ray Dillinger <bear at sonic.net>:
>https://trojansource.codes/trojan-source.pdf
>
>The skinny is that by abusing bidirectional control in unicode, hackers
>can create source code that [looks like one thing but does another.]

This reminds us why it is not a good idea to use a crescent wrench as
a hammer, even though you can often drive nails with one.

Unicode is a typesetting language, with the goal to represent all
written languages past and present. If you are old enough to remember
typesetting by hand, it is like a gigantic type case which you can use
to typeset any language. In typesetting, it is not a problem if there
is more than one way to set a chunk of text, and Unicode often has
lots of different ways to get the same visual characters. For example,
if you want lower case e with an acute accent, you can use a letter e
followed by an accent modifier which will appear over the e, or you
can use a "precomposed" e-acute character. In many typefaces, a latin
O and a Greek Omicron look identical. not a problem in typeetting. It
is easy to come up with sequences of Unicode code points that look
confusing or stupid, particularly when some are laid out left-to-right
and others right-to-left. Again, this is not news to a typesetter,
just don't do that.

Unfortunately, Unicode also turns out to be the least bad option to
represent text in many programming contexts, even though they
generally don't tolerate ambuguity and homographs. The IETF and ICANN
have been tearing their hair out for over a decade trying to come up
with profiles and rules to define subsets of Unicode that work in the
DNS, with some success but still a lot of holes and strange edge
cases.

So it is not surprising that allowing arbitrary, or even semi-arbitrary
Unicode in programming languages has surprising side effects, once
again the typesetting crescent wrench meets the programming nail.*
If you realize the problems trying to use typesetting tools in
programming applications, it's possible to mitigate the damage, e.g.
RFC 8264, but first you have to realize that things are not as simple
as they used to be.

R's,
John

* - I admit this metaphor is getting tired.
-- 
Regards,
John Levine, johnl at taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly



More information about the cryptography mailing list