Extracting uniform randomness from noisy source

Wed Aug 7 09:50:45 EDT 2002

So, the properties we want here are something like

a.  If the input had at least N+epsilon_0 bits of entropy, the N bit output
is indistinguishable from random even to a computationally unbounded opponent.

b.  If the input had at least R+epsilon_1 bits of entropy, the N-bit output
is indistinguishable from random to an attacker who is limited to less than
2^R trial hashes.

Does this look right to you guys?  Are there other subtleties I'm missing?
Are there weaker assumptions that would give us what we need here?  Because
unless I'm missing something big, here, (a) is unattainable from existing
hash functions.  

Think about the structure of SHA1 (and all other widely used hash
functions):  we're basically doing an encryption operation using a 512-bit
message block as key, and a chaining variable as IV.  Just this much detail
is enough to prove that we can't ever get the first requirement for some
kinds of entropy input.  Specifically, if we're trying to collect 160 bits
of entropy, and our input strings are padded with at least 512 known bits
at the end, then we will get outputs that are distinguishable from random,
at least if we see any reasonable number of them in a row.  This is true
because of the one-wayness property of the compression function.

Ignore the internals of the SHA1 compression function, and just think of a
block cipher with a 160-bit block and a 512-bit key.  If the block cipher
is E_K(X) for encrypting block X with key K, then our hash compression
function looks like this:

H(X,M) = E_M(X) + X

where + is word-wise addition, or XOR (bitwise addition), or something
similar.  

Suppose that X takes on each possible value with equal probability.  For a
single fixed M, H(X,M) loses about half its possible values to collisions.
That is, only about 2^{159} of the 2^{160} possible output values can come
about for a fixed M, regardless of the value in X.  

Now, imagine that we have a distribution on our input strings that gives us
full entropy plus a little, but the last 512 bits of each string is known
to the attacker somehow.  Then each output we generate has only about 159
bits of entropy.  A computationally unbounded attacker can try all 2^{160}
possible input values for a given last 512 bits of input to the compression
function, and see which values are impossible.  Each time he observes an
output, he does this.  If he's observing outputs from a real random source,
he has a 1/2 probability of becoming certain that he's NOT observing hash
outputs for each output, as the output has about a 1/2 probability of
taking on some value our hash function couldn't have generated in this
situation.  After 32 outputs, he basically has no doubts about whether he's
watching our hash outputs or outputs from a true random source.

So, at least that first criterion is unattainable with a standard hash
function.  (MD5, SHA1, SHA-256, SHA-512)  A similar argument holds for
RIPE-MD and RIPE-MD160.

A natural thing to try here is HMAC.  HMAC looks like this:

HMAC_K(X) = hash( K xor pad_0 || hash( K xor pad_1 || X ) )

Suppose we have way more entropy than we need, maybe 2^{200} possibilities
for X.  The result of hash(K xor pad_1 || X) is (at best) a
uniformly-distributed 160-bit random number--just what we're hoping for.
But now, we apply that second hash function.  For a known K, only 2^{159}
outputs will be possible, because of the one-way property of the SHA1
compression function with respect to message input blocks.  Suppose the
attacker doesn't start out knowing K, but is computationally unbounded.  In
that case, he will rule out about half the possible keys on each output,
and so after he's seen about 160 outputs (the maximum effective
contribution from that outside instance of K), he will have narrowed the
range down to about one possible key.  He expects to need a few more
outputs to be certain he knows whether he's seeing outputs from our HMAC
function or real random outputs.  (Note that once again, trying to use
another primitive with a different kind of security proof doesn't work.)  

The problem is related to using a non-invertible function here, but it's
possible to come up with non-invertible functions that don't have this
problem.  For example, imagine

X = hash(P0 || input string)
Y = hash(P1 || input string)
output = hash(X||Y)

where P0 and P1 are fixed padding strings of 512 bits (for SHA1; for
SHA-512 they'd be 1024 bits wide).    

The SHA1 compression function generating the output now gets a 320-bit
string, pads it out to one full block in a fixed way, and uses that block
to encrypt the SHA1 initial chaining value; it then feeds that initial
value forward.  Now, I don't know whether the SHA1 compression function
will give a uniform distribution on the output here, assuming X and Y each
range over about 2^{159} possible values.  But there's at least nothing
inherent in the basic structure of SHA1 that prevents a uniform output
distribution here.  The "key" for the last compression function can range
over about 2^{318} values, and there's no obvious reason to expect
nonuniform results from SHA1 in that case, even given a fixed input
chaining variable.    

Basically, the issue involves allowing all the entropy in the input string
to pass through a non-invertible "narrow pipe."  

AES-CBC-MAC doesn't have this particular property.  In fact, if the input
has a reasonable amount over full entropy, I believe it's going to be all
but impossible for an attacker to distinguish the outputs from random even
if he's computationally unbounded.  But in the computationally-bounded case
where we get less entropy than we needed for unconditional security,
AES-CBC-MAC doesn't work too well for us--that's where David's clever
meet-in-the-middle attack comes in.  

Another possibility is to consider key K to be unknown to the attacker.
(Maybe this key is generated using your first entropy outputs from a fixed
key.)  The attacker is then basically having to break CBC-MAC in order to
distinguish these values from random values, right?  This seems like
something that you can directly use an existing security proof on, without
bending that proof's assumptions all out of shape.  

How does SHA1 do on computational security?  Again, let's imagine the case
where the attacker knows the last 512 bits of the input string
corresponding to each output string.  He gets a bunch of SHA1 outputs, and
he's apparently stuck.  Since he can't enumerate the possible inputs, he
can't easily test the outputs to see if they're consistent with having come
out of SHA1.  

If he's limited to, say, 2^{120} work, I don't think he can get a
nonnegligible advantage distinguishing these outputs from random outputs.  

Again, I can't see a specific property of SHA1 that guarantees that such
computational security will always exist for these outputs.  
But I also can't see an attack that ruins that computational security.  

Comments?

--John Kelsey

---------------------------------------------------------------------
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to majordomo at wasabisystems.com