[Cryptography] canonicalizing unicode strings.

Wed Jan 31 21:36:44 EST 2018

Den 31 jan. 2018 03:58 skrev "Nico Williams" <nico at cryptonector.com>:

We had homoglyph problems before computers ('l' vs '1', for example).
They got worse not because of Unicode, but because the world is more
connected now.  Yes, we could, eg, have had a unified CJK codepoint
assignment set, but it turns out people didn't want that.

Basically, we just have to accept these issues and deal with them as
best we can: with code to heuristically detect phishing based on
homoglyphs, and code to fuzzily match Unicode identifiers.  Such code
has to evolve as scripts are added to Unicode and/or new homoglyph sets
are discovered (if we don't already know all of them).

I only see one plausible solution for multilingual text form identifiers,
although definitely not an easy one (wall of text coming).

It is to actually perform large scale testing with real people, in multiple
languages and cultures, and then feed that collected data to algorithms to
figure out what patterns (symbols and sets of symbols) that people
*actually* are at risk of confusing.

We can not rely exclusively on heuristics without any data from humans.

Finding collision candidates (potential homoglyphs) could be done with
fuzzy visual comparison algorithms / heuristics to find pretty much *all
plausible* pairs of confusable symbols, and also sets of symbols. This is
not for determining what's actually confusable, but to find candidates.
Multiple methods would need to be used. We don't care if it suggests
obviously distinct pairs here, it will be filtered out later. We'd rather
have many false positives than many false negatives.

To perform the test you could for example first come up with many, many
different kinds of samples of texts and other usage of symbols meant to
convey meaning (in multiple fonts!). Not only standard plain texts, but
also all forms of lists and instructions and more.
Then we would be randomly replacing symbols in the sample texts with
similar ones.

Then you ask people to try to read these modified texts, as well as
originals. You would ask people to identify symbols, and to tell which
symbols and/or sentences that are distinct or identical, and whichever
other tests may be necessary.

When you have all that real world data on how real people actually read all
the symbols, context included, you would probably need machine learning
algorithms to process it all (simple statistics are likely to miss a lot of
detail).
Then finally you could use the results to produce guidelines for real world
usage, producing a model of how real people parse visual symbols.

After all that, if we want a globally usable set of visually distinct
symbols complete enough to write in most languages, then we could try to
create a sufficiently complete list of non-ambiguous symbols - most likely
by "dumbing down" pairs of similar symbols to single "universal" symbols,
or "canonical homoglyphs" (?), with less detail and which depending on
context is still easily recognizable as the correct intended writing
symbols.

However I can imagine a lot of people of non-latin script languages
wouldn't be too happy with that solution, if it were the only one (a large
number of symbols would be "inaccurate" to various degrees). Another
solution, IIRC already used for domain names, is to create such lists per
individual script and forbid mixing different scripts. That could still be
problematic in but cases obviously, because of an infinite number of
multilingual texts and much more.

However, for anything meant to be an identifier in text form, a script
mixing ban SHOULD be enforced.

Alternatively, limit the number of combined scripts to fixed sets depending
on context, and then ALSO apply the above solution for filtering homoglyphs
to replace them with "canonical homoglyphs". Reducing the number of
simultaneous scripts limits the number of visual collisions.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.metzdowd.com/pipermail/cryptography/attachments/20180201/4d0bd228/attachment.html>