[Cryptography] canonicalizing unicode strings.

Howard Chu hyc at symas.com
Mon Jan 15 00:04:20 EST 2018


jamesd at echeque.com wrote:
> I would like strings that look similar to humans to map to the same item. 
> Obviously trailing and leading whitespace needs to go, and whitespace map a 
> single space.
> 
> The hard part, however is that unicode has an enormous number of near 
> duplicate symbols.
> 
> Is there somewhere a list of near duplicate unicode symbols, or existing 
> canonicalization code?

Have you already read https://www.unicode.org/reports/tr15/tr15-45.html ?

Our normalization code is in 
http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=tree;f=libraries/liblunicode;h=4896a6dc9ee5d3e78c15ed6c2e2ed2f21be70247;hb=HEAD

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/


More information about the cryptography mailing list