[Cryptography] Hierarchical bulk digest modes

Thu Aug 19 21:01:51 EDT 2021

> On Aug 18, 2021, at 07:48, Phillip Hallam-Baker <phill at hallambaker.com> wrote:
> 
> We can now detect a single bit of corruption in the data file. Great! But that doesn't give us any information if the data is corrupted. We don't know where the corruption occurred. That is fine for little files but what about when we are dealing with 1TB archives?
> 
> A better approach would be to split the data up into chunks and construct a digest (Merkle tree??) over the chunks. But how big should the chunks actually be? There is clearly a tradeoff here which we traditionally just skate over entirely because we have this 'any fault will destroy our crypto' approach.

Likely what you want is what's called a rolling hash (or rolling checksum), like a Rabin hash or Rabin-Karp, or as Jerry mentioned the rolling adler32 hash that rsync uses. Fletcher hashes are likely usable here, too. Adler32 is based on Fletcher hashes. (Years ago, PGP Disk used Fletcher hashes to calculate IVs of logical blocks because in those days tweakable modes like XTS did not exist.)

You could use a Merkle tree or something like it. It's probably more work, though on several dimensions (time, space, network, etc). 

Moreover, why reinvent the wheel? Save the time, space, and brainpower you'd use to make cryptographic hashes work properly. I'd go look at some of the many explanations of how rsync does it, even if you want to retool with a different rolling hash. If I were in your shoes, I'd just do what rsync does.

	Jon