[Cryptography] canonicalizing unicode strings.

Ray Dillinger bear at sonic.net
Sun Jan 14 23:46:34 EST 2018



On 01/14/2018 03:19 AM, jamesd at echeque.com wrote:
> I would like strings that look similar to humans to map to the same
> item. Obviously trailing and leading whitespace needs to go, and
> whitespace map a single space.
> 
> The hard part, however is that unicode has an enormous number of near
> duplicate symbols.
> 
> Is there somewhere a list of near duplicate unicode symbols, or existing
> canonicalization code?



Yes there is.  This file summarizes known unicode homoglyphs and
near-homoglyphs.

http://www.unicode.org/Public/security/revision-03/confusablesSummary.txt

Here is a utility for generating lookalike strings, based on the
information in that file.

http://unicode.org/cldr/utility/confusables.jsp

It's pretty much a given that strings which are in mixed alphabets AND
contain characters in one of the alphabets that are homoglyphs for any
character in another alphabet used in the same string, should never be
allowed as URLs, identifiers, usernames, titles, product names,
certificate identifiers, etc.

It's also pretty much a given that VERY few CAs, chat boards, social
media platforms, e-commerce sites, etc, check for and enforce any such rule.

The results are predictable.  Lots of fraud and impersonation happens
with homoglyph attacks.

Unicode homoglyphs are a security nightmare.  You can search "homoglyph
attack generator" anywhere for pen testing tools and script kiddie hacks.

				Bear




-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://www.metzdowd.com/pipermail/cryptography/attachments/20180114/d777ccb0/attachment.sig>


More information about the cryptography mailing list