[Cryptography] canonicalizing unicode strings.

Tue Feb 6 04:46:27 EST 2018

On 15/01/2018 13:04, Howard Chu wrote:
> jamesd at echeque.com wrote:
>> I would like strings that look similar to humans to map to the same 
>> item. Obviously trailing and leading whitespace needs to go, and 
>> whitespace map a single space.
>>
>> The hard part, however is that unicode has an enormous number of near 
>> duplicate symbols.
>>
>> Is there somewhere a list of near duplicate unicode symbols, or 
>> existing canonicalization code?
> 
> Have you already read https://www.unicode.org/reports/tr15/tr15-45.html ?

This link is extremely useful, but does not address the homoglyph problem.

It ensures that unicode strings that are logically equivalent, intended 
to represent the same sequence of characters, are represented by the 
same sequence of bits.

It does not address the problem of unicode strings that are logically 
inequivalent, but which look similar, for example:

1 🠚 l,
- 🠚 –
− 🠚 –
0 🠚 O
ο 🠚 o