[Cryptography] Duh, why aren't most embedded TRNGs designed this way?

Thu Apr 22 14:23:28 EDT 2021

In short, use just 2 ring oscillators in an FPGA or ASIC, clocking 2
counters, one binary and one gray code, and use some clever software to
(hopefully) securely estimate the entropy collected, avoiding the most
common reasons for TRNG failures in embedded systems.

Either this design is in common use, but somehow I've missed it, in which
case I'm a dork, or this design doesn't work, in which case, I'm a dork.
Either way, I apologize for being such a dork.

TL; DR

<rant>
The TRNG designs I review are almost always total crap, in the sense that
there is no physical model we can use to estimate how much entropy they
generate.  To compensate, folks interested in more secure TRNGs for
cryptography simply put multiple total crap TRNGs in parallel, and XOR
their outputs together <https://www.hindawi.com/journals/ijrc/2009/501672/>.
I consider this to be n * totalCrap.  It drives me nuts to see this because
they all fail in the end for the same reason, and parallel TRNGs that are
crap are just as vulnerable.  In the end some programmer gets asked to
improve the rate of random data generated by the TRNG so that an embedded
controller can boot faster, and this programmer, who doesn't know a thing
about analog electronics, simply looks at a hexadecimal string of bytes
generated, and if it "looks" random enough, they reduce the time between
samples, and once it no longer "looks" random, they increase the time
between samples by maybe 2x.  Having multiple crap TRNGs in parallel does
not stop the programmer from choosing a short enough sample delay to ensure
the output is not very random.  If they follow this process, and pass the
NIST entropy tests (also total crap), they're done!  If they have trouble
passing Diehard (and if they care), then they run the TRNG output through a
whitener, ensuring that even Diehard2 can't figure out why the output is
still total crap.  We only see that it is crap when we start finding RSA
moduli with common factors <http://www.loyalty.org/~schoen/rsa/> in the
wild.
</rant>

So, this morning, I am certainly a dork, but I'm just not sure why.  So, am
I a dork for not knowing about this design, or a dork because it is flawed?

The design:

[image: image.png]

Use two digital ring oscillators, which can be placed and routed in an FPGA
without worrying about how they are placed and routed.  With the first
oscillator, clock a binary counter with enough to have output bits that
toggle with a delay long enough to accumulate significant measurable jitter
relative to the other counter.  With the other oscillator, clock a gray
code counter, and record its output on the rising edge of the appropriate
bit from the binary counter.

The deltas between gray code counter values form a normal distribution
<https://en.wikipedia.org/wiki/Jitter#:~:text=Random%20jitter%20typically%20follows%20a,distributions%2C%20approaches%20a%20normal%20distribution.>,
if only thermal noise causes the jitter.  This should be measured by the
host CPU continuously, as a health monitor.  Ideally, we only accept the
output of the TRNG if:

   - The output distribution closely approximates a normal distribution
      - The distribution estimation should be updated continuously, as it
      can drift with voltage, age, etc.
   - A filter continuously trained to predict the next output from several
   prior samples should be used to reduce the entropy estimate per sample

This scheme should work well, assuming:

   - One sample has negligible impact on a sample several in the future
   (more samples than the trained filter has)

The entropy estimate per sample is simply log2(1/likelihood) of a measured
sample.  This is true independent of the TRNG design.  The trick is being
able to accurately predict the likelihood of each sample.There will be many
sources of noise causing jitter, and thermal noise is likely to be only a
small component.  These other sources come in various forms, but entropy
from such noise can be accounted for with a filter trained to predict the
likelihood of the next sample given several prior samples, assuming that
each output sample is independent of samples further away in time than the
number of samples used by the filter.

Unfortunately, the assumption of independence of samples at least n appart
can easily be invalid.  For example, a switching power supply for the chip
might deliver saw-tooth shaped voltage, at a frequency of e.g.  200KHz.
Or, a physically present attacker can inject custom designed supply voltage
designed to PWN the TRNG entropy estimate.  The filter may work OK in this
case, if several samples in a row all being centered around a different
part of the distribution will cause the filter to predict that the next
value is also in this area.  A strong attack might be where an attacker
introduces supply voltage variation roughly in sync with the sample period,
in a pseudo-random normal distribution, causing the filter to estimate that
each sample has more entropy than it really does, since the attacker knows
the actual voltage pattern.

Systems that need to defend against a physically present attacker should be
conservative, and estimate that most of the entropy estimated is from the
attacker, and collect more before generating keys to compensate.  One
mitigation might be to remember the lowest amount of entropy ever generated
by the TRNG, and use that to determine the minimum number of samples to
collect.

Defending against physical attacks is super hard, and there are various
mitigations, such as temperature swing detection, voltage swing detection,
detection of voltage spikes on I/Os above or below supplies, etc.  There
are more complex, and better designs than this one for systems that need to
defend against physically present attackers.

However, for typical embedded devices where they just want to instantiate a
TRNG from a Verilog library, and install a provided TRNG driver for the
embedded core, it appears possible to do a decent job with the above design.

So, why am I a dork today?  Fatal flaw, or just another rediscovery of an
already popular circuit?

Bill
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.metzdowd.com/pipermail/cryptography/attachments/20210422/325a565f/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 68936 bytes
Desc: not available
URL: <https://www.metzdowd.com/pipermail/cryptography/attachments/20210422/325a565f/attachment.png>