statistical inferences and PRNG characterization

Fri May 19 17:37:11 EDT 2006

Travis H. wrote:
> Hi,
> 
> I've been wondering about the proper application of statistics with
> regard to comparing PRNGs and encrypted text to truly random sources.
> 
> As I understand it, when looking at output, one can take a
> hypothetical source model (e.g. "P(0) = 0.3, P(1) = 0.7, all bits
> independent") and come up with a probability that the source may have
> generated that output.  One cannot, however, say what probability such
> a source had generated the output, because there is an infinite number
> of sources (e.g. "P(0) = 0.29999.., P(1) = 7.000...").  Can one say
> that, if the source must be A or B, what probability it actually was A
> (and if so, how)?

Short answer: you can say what evidence the observed output provides in
favor of A vs B, and if you're willing to start with a prior belief
(that A and B are equally likely before the observation, say), then you
can calculate the probability after the observation:

P(A) / P(B) = P(D | A) / P(D | B)
 (given that they were equally likely before seeing the data D)

Long answer:

There are a few philosophies on the correct way to think about such
issues.  The Bayesian inference approach would simply be to write the
probability that a source model is correct (given some data) in terms of
the probability that one would observe the data (given that source
model) and the probability of the source model being correct,
independent of the data (aka the prior), and the probability of the data
given any model:

P(M1 | D) = P(D | M1) * P(M1) / P(D)

  M1 = model 1, D = observed data.  This equation is Bayes' Theorem,
  and  is trivially derived from from the fact that
  P(M1 & D) = P(M1 | D) * P(D) = P(D | M1) * P(M1)

Depending on the type of data, determining P(D) can be an exercise in
philosophical futility, but note that in the ratio of two "posterior"
probabilities (the terms on the LHS), the P(D) terms will cancel:

P(M1 | D) / P(M2 | D)  =  P(D | M1) / P(D | M2)  * P(M1) / P(M2)

Now, if (say) you are using this as a spam filter, you might determine
that the prior probability (P(M1)) for one model is different from the
prior probability of a different model... for me, the odds that a given
email is spam seems to be about 99%, so P(M1)/P(M2) might be about 0.01.
 For a cryptographic application, well, you'd want to figure out
something sensible given the two models.  People who dislike Bayesian
inference tend to have an issue with the "arbitrary" choice of prior
probabilities.

Note that the ratio of the likelihoods (P(D | M1) is the likelihood of
M1 given the data D) is a measure of the evidence in favor of M1 vs M2,
provided the data D.  Non-Bayesian approaches tend to use the log of
this ratio as a statistic with a chi^2 distribution.

Oh, and do beware the number of degrees of freedom in your two models-
if they are not equal, you need to do a bit more work that outlined here.

-Bill

---------------------------------------------------------------------
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to majordomo at metzdowd.com