[Cryptography] canonicalizing unicode strings.

Nico Williams nico at cryptonector.com
Thu Feb 1 12:08:41 EST 2018


On Thu, Feb 01, 2018 at 03:36:44AM +0100, Natanael wrote:
> Den 31 jan. 2018 03:58 skrev "Nico Williams" <nico at cryptonector.com>:
> 
> > We had homoglyph problems before computers ('l' vs '1', for example).
> > They got worse not because of Unicode, but because the world is more
> > connected now.  Yes, we could, eg, have had a unified CJK codepoint
> > assignment set, but it turns out people didn't want that.
> > 
> > Basically, we just have to accept these issues and deal with them as
> > best we can: with code to heuristically detect phishing based on
> > homoglyphs, and code to fuzzily match Unicode identifiers.  Such code
> > has to evolve as scripts are added to Unicode and/or new homoglyph sets
> > are discovered (if we don't already know all of them).
> 
> I only see one plausible solution for multilingual text form identifiers,
> although definitely not an easy one (wall of text coming).
> 
> It is to actually perform large scale testing with real people, in multiple
> languages and cultures, and then feed that collected data to algorithms to
> figure out what patterns (symbols and sets of symbols) that people
> *actually* are at risk of confusing.
> 
> We can not rely exclusively on heuristics without any data from humans.

That's still heuristics.

Perhaps the best heuristic would be to use an AI image recognizer
trained to look for homoglyph texts.  When it comes to phishing we need
close to 0% false negative rate, otherwise we end up with lots of
victims no matter what.

The point I was making is that mixed scripts are fine for identifiers,
but that one must disallow subsequent new identifiers that look the same
as existing identifiers (unless they are aliases of those).

Heuristics and/or occasional human intervention are needed because some
homoglyphs may need to be tolerated.  E.g., darn vs dam -- OK or not OK
as identifiers for different entities in the same namespace?

Nico
-- 


More information about the cryptography mailing list