[Cryptography] canonicalizing unicode strings.

Peter Todd pete at petertodd.org
Mon Jan 15 11:12:08 EST 2018


On Sun, Jan 14, 2018 at 08:46:34PM -0800, Ray Dillinger wrote:
> 
> 
> On 01/14/2018 03:19 AM, jamesd at echeque.com wrote:
> > I would like strings that look similar to humans to map to the same
> > item. Obviously trailing and leading whitespace needs to go, and
> > whitespace map a single space.
> > 
> > The hard part, however is that unicode has an enormous number of near
> > duplicate symbols.
> > 
> > Is there somewhere a list of near duplicate unicode symbols, or existing
> > canonicalization code?
> 
> 
> 
> Yes there is.  This file summarizes known unicode homoglyphs and
> near-homoglyphs.
> 
> http://www.unicode.org/Public/security/revision-03/confusablesSummary.txt

If possible I always recommend using a whitelist rather than the blacklist
approach shown above, which will inevitably get out of date as new unicode
homoglyphs and near-homoglyphs get added to unicode.

-- 
https://petertodd.org 'peter'[:-1]@petertodd.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: Digital signature
URL: <http://www.metzdowd.com/pipermail/cryptography/attachments/20180115/b42275f8/attachment.sig>


More information about the cryptography mailing list