[Cryptography] Hierarchical bulk digest modes

Wed Aug 18 10:48:44 EDT 2021

Traditionally, a digest converts a sequence of bits to a fixed length output

M(d) -> 256 bits

We can now detect a single bit of corruption in the data file. Great! But
that doesn't give us any information if the data is corrupted. We don't
know where the corruption occurred. That is fine for little files but what
about when we are dealing with 1TB archives?

A better approach would be to split the data up into chunks and construct a
digest (Merkle tree??) over the chunks. But how big should the chunks
actually be? There is clearly a tradeoff here which we traditionally just
skate over entirely because we have this 'any fault will destroy our
crypto' approach.

Another problem with the traditional approach is that it lacks parallelism.
I have 64 cores on my CPU, I would like to be able to use all of them for
hashing. So that points to dividing up the file at the block level, so if I
have 16 chunks and want to use 4 threads, instead of calculating

0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

I would instead do

0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

In either case, the result is D0 = H(H(0) +  H(1) +  H(2) +  H(3) + I)

Where I is an index descriptor giving the index of the chunk in the
document, hash mode, etc.

I would then build a 'fat hash' which would consist of D0 + D1 + D2 + ... Dn

OK so at this point we can detect corruption but not recover from it. To do
that, we need redundancy in the data. So a RAID like construct would be
required within the data package:

0 1 2 3 P

p = 0 xor 1 xor 2 xor 3
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.metzdowd.com/pipermail/cryptography/attachments/20210818/1d8144d6/attachment.htm>