Fingerprinting Your Files

Sat Aug 7 15:17:19 EDT 2004

<http://www.technologyreview.com/articles/04/08/wo_garfinkel080404.asp?p=0>

Technology Review

Fingerprinting Your Files
 "Hash" functions identify digital content with mathematical certainty-but
is that enough to foil the hackers?

By Simson Garfinkel
 The Net Effect
8/04/2004

Three cryptographers at Stanford University recently came up with a clever
solution to the persistent problem of identity theft on the Internet. Wily
hackers in Russia, China, and other countries send out piles of e-mail
messages looking like they came from some financial institution such as
Citibank or Paypal. Millions of consumers get these messages, which have
embedded HTML links in them that take the unsuspecting recipient to
look-alike websites run in faraway places. You're prompted to enter a
username and password and then-wham-the hacker has the keys to your bank
account.

But good usernames and passwords typed at bad websites isn't the only such
threat that consumers face. A potentially larger problem is that many
people use the same username and password combination at multiple sites.
This makes memorization easier, but it means that an unscrupulous website
operator can take a list of usernames and passwords from, say, an Internet
sweepstakes site and use it to try to break into online bank accounts.

So Stanford cryptographers Blake Ross, Dan Boneh, and John Mitchell have
designed a clever plug-in for Internet Explorer that solves this problem by
scrambling what you type into the password field so every website sees a
different password-a password that's based both on what you type and on the
domain of the website itself.

Now, lots of people use some variant on this strategy. Their Hotmail
password might be "nosmis-hotmail" while their Yahoo! Personals password
is "nosmis-Yahoo!" But any strategy like this is pretty simple to decipher.
The password scrambling method that the Stanford trio has devised is based
on a mathematical function called a cryptographic hash-a kind of one-way
function that transforms what the user types into a jumble of numbers and
letters in a way that cannot be reversed. Because the Stanford system
calculates the cryptographic hash of both the website's domain and the
user's password, the hacker gets different passwords than the legitimate
ones. (Click here to find details about this clever solution.)

One company that's using cryptographic hashes in a very public way is
Yahoo! Last year, Yahoo! redesigned the login process to its website to
make it sniff-proof. The standard way to do this is to use encryption. But
encryption can be slow-especially when you are running one of the most
popular sites on the Internet.

So what Yahoo! did instead was to modify its login page to use a so-called
challenge-response system based on a cryptographic hash. When you try to
log in, Yahoo!'s server downloads to your browser a cryptographic hash
function written in JavaScript. Along with this function is a "challenge"-a
short sequence of letters and numbers. When you type your password into the
login screen, your browser takes your password, appends these characters
provided by Yahoo!, and calculates the cryptographic hash of the resulting
string. The browser then sends the resulting value back to Yahoo!, no
encryption needed. Even if you are at a cybercafe having your Web traffic
sniffed by Belgium hackers, there's no way for the bad guys to take the
resulting hash value and derive your original password.

This clever "challenge-response" system is also at the base of the Mobil
Speedpass system: it's what makes the Speedpass radio frequency
identification (RFID) tag so difficult to clone. Other RFID systems don't
use challenge-response, which makes attacking them comparatively easy.

But what is this cryptographic hash function, anyway?

The Incredibly Useful Hash

Cryptographic hash functions are one of the fundamental building blocks of
today's digital economy. Nevertheless they remain in many ways a
mystery-both to the cryptographers who create them, and to the general
public that uses them every day.

Hash functions are sometimes called "fingerprint functions" because they
can be used to create a unique "fingerprint" of a digital file. The
fingerprints are usually 128-bit or 160-bit numbers that are displayed as a
sequence of hexadecimal digits. The fingerprint of my name using the MD5
system, for example, is c55bbe0f3ba258f5b1cb6d5b62b0b360. Hash functions
are designed so that, in theory at least, no two files will ever hash to
the same value.

So that you can get an idea of how these fingerprinting functions work,
we've embedded a JavaScript-based MD5 calculator below. Just type in some
text and you can see the MD5 hash. Notice how it completely changes every
time you add, remove, or change a letter. The manner that the fingerprint
changes is unpredictable-indeed, if we could predict how it changes, then
file fingerprints wouldn't be very useful.

Enter your text below:

The MD5 is:

Most of the hash functions used today are based on a technique developed by
MIT professor Ron Rivest in the 1980s. (Rivest is probably best known for
being the "R" in the RSA encryption algorithm, the public key encryption
algorithm that's built into practically every Web browser.) At the time,
Rivest and other mathematicians were working out the details of the basic
cryptographic operations that we now take for granted. The hash functions
were envisioned as a kind of cryptographic compression system-a way to take
a large file and crunch it down to a short string of letters and numbers.

The idea was to use these fingerprints as a kind of surrogate for the files
themselves. Instead of digitally signing the entire file, Rivest and others
reasoned, you could digitally sign the hash. Because public-key
cryptography involves a lot of heavy-duty math, hash functions make it
almost as fast to sign an extremely long file as to sign a short file.

One of the most basic things that you can do with a hash function is to
find out if a file has changed: just calculate the hash of a file and write
it down. Later on you calculate the hash again. If the hash hasn't changed,
then the odds are overwhelming that the file hasn't changed either.

For example, say that you keep the finances of your small business using
QuickBooks and you want to go on vacation for a few days: people need to
use your computer but you want to make sure that nobody modifies the
QuickBooks data. One simple thing that you can do is compute the
cryptographic hash of the file before you leave and write the number on an
index card. When you come back from vacation, just re-compute the hash. If
the two values don't match, you know that the file has been tampered with.

Of course, you don't need to stop with just one file. You could compute the
cryptographic hash of every file on your computer and put them all into a
new file-call that file hashes.txt. You could then compute the hash of
hashes.txt and write this "fingerprint" on your note card. Repeat the
process when you come back from vacation and you'll have a fast way of
knowing if any file on your entire computer has changed. (You won't have
any way of knowing which file has changed, but that's a different problem.)

This idea of computing the hash of a hash is the basis of an intrusion
detection system called Tripwire that Purdue University computer science
professor Gene Spafford and his graduate student Gene Kim invented back in
the early 1990s. (Spafford and I have co-authored five books on computer
science.) Today, many different programs use this Tripwire approach to
assure the integrity of computer files and databases.

Computing hashes of hashes is also the basis of a secure timestamp service
invented by Stuart Haber and Scott Stornetta while the two were at Bellcore
in 1990. The service, called Surety, makes it possible to generate a
cryptographically secure and unforgeable proof that a given document,
photograph, or other file existed at a particular time on a particular date
and that it hasn't been changed since.

The Surety technique works by computing a "hash tree" based on the hash
codes of every document being time-stamped. The root of the tree is then
published in a well-known location-it could, for example, be printed in a
classified advertisement in the New York Times. You can prove that your
document existed on the day in question by showing that your document's
fingerprint was needed to generate the fingerprint-of-fingerprints that
appeared in the newspaper.

Other companies and even the U.S. Postal Service have since created their
own electronic time-stamp service. But all of these systems rely on an
organization that acts as a "trusted third-party" that in effect signs your
document using their private key. The problem with this approach is that
the third-party needs to be completely trustworthy: if that third-party
decides to create a signature with the wrong date, or some hacker manages
to steal the third-party's private key, there is no way to tell a
fraudulent signature from a valid one. It's also possible to create
fraudulent Surety signatures, of course, but you would need to either go
back in time and change what was printed in The New York Times, or else
travel all over the world, find every copy that was printed, and change the
old fingerprint-of-fingerprints to the new one.

How Hash Functions Work

So that's why hash functions are helpful. Now, let's see what they actually
look like.

Among the most widely used hash functions today are the so-called MD5 (for
Message Digest #5). MD5 produces a hash that is 128 bits long and that is
commonly written as sequence of 32 hexadecimal (base 16) digits. If you
were to take my name and process it with MD5, you would get this seemingly
random string:

c55bbe0f3ba258f5b1cb6d5b62b0b360

Or, to state it with more mathematical formality:

MD5("Simson Garfinkel")= c55bbe0f3ba258f5b1cb6d5b62b0b360

Each of those hexadecimal characters represents 4 bits; the MD5 value of my
name is actually:

1100010101011011101111100000111100111011101
0001001011000111101011011000111001011011011
010101101101100010101100001011001101100000

Most people work with the hexadecimal representation because it's pretty
easy to look two hashes and tell if they are the same or different.

MD5 works by splitting the file up into lots of small pieces, and then
taking each of those chunks and performing hundreds of mathematical
operations that shuffle, invert, transpose, and otherwise process the bits
into an unrecognizable mess. The word "unrecognizable" in this description
is key. The fundamental requirement of a good hash function is that it
should be impossible to predict the fingerprint of a file without actually
going to the effort of computing that fingerprint -there must be no
short-cuts. If there were, you might be able to run the hash function
backwards and create a file that had a specific hash-for example, the hash
of another file. Indeed, the entire security of hash functions falls apart
utterly if it is possible to generate two files that have the same hash.

The beauty of the hash function is that even a tiny modification to the
input produces a dramatic change in the output. Mathematically, the
functions are designed so that every bit in the output will have a 50
percent chance of changing for every single bit changed in the input.

Let's look at another MD5 hash, this one of a slightly different
representation of my name:

MD5("Simson L. Garfinkel")= df876e8e6f548d5be698fab7f06dd278

Merely adding "L." produces a completely different hash. If you compare the
two hashes bit-for-bit you'll find that 63 out of the 128 positions have
changed from a 0-to-1 or a 1-to-0, and the other 65 have remained unchanged.

Unfortunately, the whole theory of cryptographic hash functions has a huge
problem. The use of these functions requires that there be no so-called
"collisions". Either accidentally or on purpose, there should be no two
files that have the same cryptographic fingerprint. And as it turns out,
this is an impossible requirement.

The reason is pretty simple. File fingerprints are a fixed size, which
means that there is a finite number of possible fingerprints. Files, on the
other hand, can be any size. Thus, there are more possible files than
fingerprints, and so there must be at least one fingerprint that is the
fingerprint of multiple files. The mathematical term for this is the
"pigeonhole principle." Indeed, even if you restrict yourself to files that
are just nine characters long, there are still 256 times the number of
possible files as the number of possible fingerprints.

The reason that the pigeonhole principle doesn't render hash functions
completely pointless is that there are an astounding number of possible
fingerprints-far more, in fact, than the number of files on the planet.
(With MD5 there are 2128 possible fingerprints. Now, the total number of
computer hard drives that have ever been manufactured is only around 229.
If every hard drive had a million unique files-a gross overestimation-there
would still be only 249 individual files. That's a much, much, much smaller
number than 2128.)

The SHA-1 Controversy

For tutorial purposes, I have used the MD5 hash function. But these days
MD5 is considered passé-instead most of the world is moving over to the
U.S. government's Secure Hash Algorithm, known as SHA-1, a standard adopted
by the National Institutes of Standards and Technology (NIST) back in the
early 1990s.

Today SHA-1 is a widely respected algorithm, but it has a troubled history.
Back in 1993, the U.S. government was trying to get industry to adopt the
so-called Clipper Chip-a secret encryption system designed by the National
Security Agency. During the so-called "crypto wars" that raged around
Clipper, NIST proposed that the U.S. government adopt its own Secure Hash
Algorithm as part of the Federal Information Processing Standards. For
technical reasons, hash functions should have twice as many bits as the
encryption algorithms that they work with. Clipper was an 80-bit encryption
algorithm, so the standard was designed to produce a 160-bit fingerprint.

One might think that the governments' standard, with its 160-bit
fingerprint, would be more secure than the 128-bit MD5. But like Clipper
itself, SHA was designed by the National Security Agency-and both NIST and
the NSA declined to explain the principles that were used in its design.
Some people wondered if the NSA might have hidden some kind of "back door"
inside the algorithm so that the agency could generate collisions on
demand. Such a back door could be used, for example, to produce faked
digital signatures-something that the Central Intelligence Agency might
find useful. A fake digital signature might be used, for example, to sign
an electronic order giving an U.S. spy access to a database in a foreign
country.

Lots of cryptographers and other academics analyzed the SHA algorithm and
couldn't find anything wrong with it. On May 11, 1993, NIST proclaimed SHA
as the nation's Secure Hash Algorithm. But the ink was barely dry on this
decree when NIST announced that it had made a mistake. For reasons that
would not be revealed at the time, NIST published a modified version of the
Secure Hash Algorithm-the algorithm that we now call SHA-1.

The conspiracy theorists in the cryptography community (and there are many)
had a field day. Was SHA so powerful that the NSA had decided that it had
to be "dumbed down?" Or had NSA perhaps planted a back door in SHA-and
somebody at NIST had found out? Were both algorithms equally secure, and
the cryptographers at the NSA were just messing with people's minds?

In August 1998, the world more-or-less learned the answer to the SHA vs.
SHA-1 mystery. Florent Chabaud and Antoine Joux, two French cryptographers,
came up with a theoretical attack against the first version of SHA-an
attack against which SHA-1 just happened to be secure. Almost certainly,
the folks at NSA knew about this attack and proposed SHA-1 as a
countermeasure. What's interesting here is that NSA's cryptographers
probably didn't know about the attack when SHA was first proposed in
1993-which means that the world's top cryptographic agency was only five
years ahead of the cryptographers in academia.

Today hash functions are also commonly used to generate repeatable but
unpredictable "random numbers," for converting typed passwords into values
suitable for using as encryption keys. Instead of storing passwords
directly, many computer systems store the hash of a password. This prevents
somebody who breaks into a computer from learning everybody's password.

Hash functions have been proposed as a way to fight spam and as the basis
for digital cash systems. Mathematician Peter Wayner published a book
called Translucent Databases a few years ago in which he showed how hash
functions could be used for storing information in a database in a way
that's protected by the organization that's running the database. A college
admissions department, for example, could store student social security
numbers in the database so that these numbers could still be used as
identifiers on applications, but so that nobody in the admissions office
could sit down at a terminal and get a list of students and their numbers.
So far, though, none of those approaches have really gotten off the ground.

All in all, cryptographic hashes are one of the most interesting and useful
mathematical techniques that cryptographers have come up with over the past
20 years-and we're still finding new uses for them all the time.

Simson Garfinkel is the author of nine books on computing, including
Database Nation.
-- 
-----------------
R. A. Hettinga <mailto: rah at ibuc.com>
The Internet Bearer Underwriting Corporation <http://www.ibuc.com/>
44 Farquhar Street, Boston, MA 02131 USA
"... however it may deserve respect for its usefulness and antiquity,
[predicting the end of the world] has not been found agreeable to
experience." -- Edward Gibbon, 'Decline and Fall of the Roman Empire'

---------------------------------------------------------------------
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to majordomo at metzdowd.com