[Cryptography] let's do something intelligent about md5sum!

Michael Kjörling michael at kjorling.se
Sat Jun 13 18:21:19 EDT 2015


On 13 Jun 2015 14:18 -0700, from jsd at av8n.com (John Denker):
> The 500th time I had this conversation, I got really
> tired of it.  I learned my lesson.  The cost of a program
> has to include the cost of the whole life-cycle, including
> support, evolution, and outroduction.  There's always a 
> lot of pressure to get "something" out the door, but 
> sometimes that leads to disastrously false economies.

For the people not in the know, this is often termed "technical debt".
https://en.wikipedia.org/wiki/Technical_debt


> As a tiny but specific example of the sort of thing I'm
> talking about, consider a manifest containing filenames
> and checksums.  I say the manifest ought to contain some
> sort of version marker that indicates which checksum
> algorithm was used (md5sum or whatever).  The counter-
> argument is that the file is shorter and easier to parse
> if it does not contain the version marker ... but still
> I insist that omitting the marker is a false economy,
> because it makes it harder to blaze a migration path
> later in the life cycle.

This is something that the people writing the code to move away from
simply using crypt() for password storage encryption on *nix
(specifically as far as I am aware Linux) had to contend with. It was
easier there, though, in part because the encrypted password field was
not a fixed-length field the way a 128-bit MD5 hash is fixed-length.
The solution they came up with was fairly elegant, particularly given
the constraints (pure text file, likely not wanting to alter the data
meta-format itself by introducing a new field, ...): _prepend an
identifier to the value, which is invalid in the old scheme, stating
which algorithm was used to generate it._

crypt() already had 12 bits of salt, expressed as two base64 bytes,
prepended to the encrypted value (itself expressed as base64), so this
was easy to do by using a combination that could never be valid Base64
(the format became `$` followed by a numeral followed by `$` followed
by the encrypted or hashed value as further defined by the specific
scheme; if the first character of the stored value is not `$` then you
know it's old-style crypt()). These days we have a handful of
algorithms, each with its own identifier; tools that validate
passwords need to be aware of the old schemes; but tools that _set_
passwords only need to know about any _one_ of these algorithms
(ideally, the most secure one, for some value of secure). Thus, over
time, as people change their passwords, the data gets migrated
_transparently_ to, hopefully, stronger algorithms. It also allows for
the possibility of administrators being able to disable the old
schemes once those are no longer needed.

If we are going to break things anyway, a similar approach could, in
principle, be used for hash values. Reserve, say, two bytes prepended
to the hash (65,536 options should be enough for anybody, right?)
which specifies the specific hash in use.

The only major downside is that you need some sort of coordination in
assigning the identifiers, and in the case of storing hashes in a
database or similar that the maximum length of the field is unknowable
ahead of time. (The maximum output length given a _particular set_ of
algorithms can certainly be knowable, however, and moving from a
128-bit output hash length to a 256-bit output hash length is already
a schema breaking change with fixed-length storage fields, so no major
change there.) An additional downside is that software that needs to
verify the hashes need to support several different algorithms, but
with a reasonably sized set, that is managable. (Think something like
perhaps MD5 and SHA1 for backwards compatibility, and SHA256, SHA512,
BLAKE2 and SHA3 moving forward. While not everyone would get their pet
hash algorithm, it seems like a reasonable starting set _could be
agreed on_.)

With something like this, software can also reasonably intelligently
handle the case where the indicated hash algorithm or scheme is not
supported.

-- 
Michael Kjörling • https://michael.kjorling.semichael at kjorling.se
OpenPGP B501AC6429EF4514 https://michael.kjorling.se/public-keys/pgp
                 “People who think they know everything really annoy
                 those of us who know we don’t.” (Bjarne Stroustrup)


More information about the cryptography mailing list