[Cryptography] canonicalizing unicode strings.
jamesd at echeque.com
jamesd at echeque.com
Tue Feb 6 04:46:27 EST 2018
On 15/01/2018 13:04, Howard Chu wrote:
> jamesd at echeque.com wrote:
>> I would like strings that look similar to humans to map to the same
>> item. Obviously trailing and leading whitespace needs to go, and
>> whitespace map a single space.
>>
>> The hard part, however is that unicode has an enormous number of near
>> duplicate symbols.
>>
>> Is there somewhere a list of near duplicate unicode symbols, or
>> existing canonicalization code?
>
> Have you already read https://www.unicode.org/reports/tr15/tr15-45.html ?
This link is extremely useful, but does not address the homoglyph problem.
It ensures that unicode strings that are logically equivalent, intended
to represent the same sequence of characters, are represented by the
same sequence of bits.
It does not address the problem of unicode strings that are logically
inequivalent, but which look similar, for example:
1 🠚 l,
- 🠚 –
− 🠚 –
0 🠚 O
ο 🠚 o
More information about the cryptography
mailing list