"Approximate" hashes

Thu Sep 2 04:50:44 EDT 2004

> > On Behalf Of Marcel Popescu
>...
> > My problem is that I don't know what happens with the email in transit
> > (this, I believe, is an observation in the hashcash FAQ). I
> > am worried that some mail server might dislike ASCII characters with ....
> >
> > Hence my question: is there some "approximate" hash function
> > (which I could > use instead of SHA-1) which can verify that a
> > text hashes  "very close" to a value?

At 12:27 PM 9/1/2004, Keith Ray wrote:
>nilsimsa
>Computes nilsimsa codes of messages and compares the codes and finds
>clusters of similar messages so as to trash spam.

Check out Vipul's Razor, which uses an approach similar to this.
You'll find information at Cloudmark and on Sourceforge.

There are several different kinds of differences to work around -
- damage in transit, as noted, though it's the least of your worries
         in spite of Unicode, MS Codesets, and 8-bit-uncleanness
- different mail headers getting added or subtracted or mimed
         (Some people include relevant parts in their message indexes, some 
don't.)
- deliberate differences introduced in the message to discourage detection,
         ranging from the simple "Dear Alice"/"Dear Bob" to
         removal addresses than encode each spam victim's info,
         to different random word-scramble that's also there to
         discourage Bayesian spam-detectors.
         This one's really common these days, especially as mail systems
         have decreased the number of users they'll send to
         in a given SMTP session / envelope because of spamming -
         if you can only spam 5-10 recipients per TCP session,
         might as well make each session somewhat different
         so you only get hit by local detectors, not global indexers.

Vipul's Razor and related approaches try to calculate a unique id
for each message so that if a human detects that a message is spam,
the id can be published so everybody else trashes it.
This usually needs more than one human rating something as spam
to prevent abuse, and there's some tuning, but it's a good start.

---------------------------------------------------------------------
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to majordomo at metzdowd.com