[Cryptography] SHA1 collisions make Git vulnerable to attakcs by third-parties, not just repo maintainers

Sun Feb 26 01:36:15 EST 2017

I tried to fix this when git was young, when it would've been easy.
Linus rejected the suggestion and didn't seem to understand the
threat.  He wired assumptions about SHA1 deeply into git.  In the next
few years, nasty people will teach him the threat model, with ungentle
manipulations of his and many other peoples' source trees.

	John

To: torvalds at osdl.org, gnu at toad.com
Subject: SHA1 is broken; be sure to parameterize your hash function
Date: Sat, 23 Apr 2005 15:21:07 -0700
From: John Gilmore <gnu at new.toad.com>

It's interesting watching git evolve.  I have one comment, which is
that the code and the contributors are throwing around the term "SHA1
hash" a lot.  They shouldn't.  SHA1 has been broken; it's possible to
generate two different blobs that hash to the same SHA1 hash.  (MD5
has totally failed; there's a one-machine one-day crack.  SHA1 is
still *hard* to crack.)  But as Jon Callas and Bruce Schneier said:
"Attacks always get better; they never get worse.  It's time to walk,
but not run, to the fire exits.  You don't see smoke, but the fire
alarms have gone off.  It's time for us all to migrate away from
SHA-1."  See the summary with bibliography at:

  http://www.schneier.com/crypto-gram-0503.html

Since we don't have a reliable long-term hash function today, you'll
have to change hash functions a few years out.  Some foresight now
will save much later pain in keeping big trees like the kernel secure.
Either that, or you'll want to re-examine git's security assumptions
now: what are the implications if multiple different blobs can be
intentionally generated that have the same hash?  My initial guess is
that changing hash functions will be easier than making git work in
the presence of unreliable hashing.

In the git sources, you'll need to install a better hash function when
one is invented.  For now, just make sure the code and the
repositories are modular -- they don't care what hash function is in
use.  Whether that means making a single git repository able to use
several hash functions, or merely making it possible to have one
repository that uses SHA1 and another that uses some future
WonderHash, is a system design decision for you and the git
contributors to make.  The simplest case -- copying a repository with
one hash function into a new repository using a different hash
function -- will change not only all the hashes, but also the contents
of objects that use hash values to point to other objects.  If any of
those objects are signed (e.g. by PGP keys) then those signatures will
not be valid in the new copy.

Adding support now for SHA256 as well as SHA1 would make it likely
that at least git has no wired-in dependencies on the *names* or
*lengths* of hashes, and let you explore the system level issues.  (I
wouldn't build in the assumption that each different hash function
produces a different length output, either, though these two happen
to.)

Enjoy,

	John Gilmore

Date: Mon, 25 Apr 2005 13:38:40 -0700 (PDT)
From: Linus Torvalds <torvalds at osdl.org>
To: Seth David Schoen <schoen at eff.org>
cc: John Gilmore <gnu at toad.com>, Kees Cook <kees at osdl.org>
Subject: Re: John Gilmore on SHA-1 [gnu at toad.com: Pls forward to Linus: SHA1
 is broken]
In-Reply-To: <20050425192520.GS14282 at zork.net>

...
As to your SHA1 concerns:

> It's interesting watching git evolve.  I have one comment, which is
> that the code and the contributors are throwing around the term "SHA1
> hash" a lot.  They shouldn't.  SHA1 has been broken; it's possible to
> generate two different blobs that hash to the same SHA1 hash.

Actually, even the theoretical breaking has not been proven for a 
pre-existing SHA1 hash (ie you need to control both the starting point for 
it), and more importantly, git really uses the SHA1 has a _hash_, not 
necessarily as a cryptographically secure one.

IOW, security doesn't actually depend on the hash being cryptographic, and
all git really wants is to avoid collisions, ie it wants it to hash the
contents well. That, sha1 definitely does, and even an md5sum would
suffice (but having 160 bits instead of "just" 128 obviously adds to the
space, so that's always a bonus).

Of course, the fact that sha1 is also very expensive to try to fool is a 
big bonus, since it means that it's just another layer on the real 
security model. But the _real_ security comes from the fact that git is 
distributed, which means that a developer should never actually use a 
public tree for his development.

For example, I've got two separate firewall layers (and a NAT) in between 
me and the internet, and my personal tree is on that machine. I never 
actually trust or use the external trees - I just push the result to them. 

This is something you cannot do with a centralized SCM server like SVN or
other traditional crud. A centralized one obviously has to be accessible
to all the developers, which means that it's forced to be open enough to
be much more easily attackable, and also means that there is a single 
point of failure also from a security standpoint. 

In contrast, even if somebody were to compromise my machine, that does 
_not_ automatically compromise the trees of other developers. They'd still 
have all the pristine objects, and never even fetch an object from me that 
has the same name (ie sha1 hash) as one they already have.

In other words, to really break a git archive, you need to

 - be able to replace an existing SHA1 hash'ed object with one that hashes
   to the same thing (_not_ the breakage that has been  shown to be 
   possible already)
 - the replacement has to still honor all the other git consistency checks 
   (even "blob" objects have them: they need to have a valid header with a
   valid length, so it's not sufficient to just find another object that 
   hashes to the right thing, you have to find an object with a valid 
   header that hashes to the right thing)
 - you have to break in to _all_ archives that already have that object 
   and replace it quietly enough that nobody notices.

Quite frankly, it's not worth worrying about. It's a hell of a lot easier
to just break a source archive with other means (ie pay a developer ten
million dollars to just insert the back door you want inserted).

		Linus

To: David Wagner <daw at cs.berkeley.edu>
Subject: Re: Linus Torvalds: Re: SHA1 is broken 
Date: Fri, 29 Apr 2005 01:20:21 -0700
From: John Gilmore <gnu at toad.com>

> SHA1 isn't totally broken yet.  The attack still requires at least
> 2^60 work to find a collision.  

Knew that -- but "Attacks never get harder, only easier."

> No one has publicly reported finding a collision in SHA1 yet.

I thought the Chinese team had reported four pairs of colliding 
plaintexts -- they just hadn't revealed exactly how they generated
them.  Or are you distinguishing "finding" from "generating" a collision?

> One question I would have is what is the impact of a SHA1 collision on
> his system?  In other words, what harm can you do if you can find SHA1
> collisions efficiently?  I'm not familiar with his source mgmt system,
> but if there is little harm one can do with a collision, then maybe it
> just doesn't matter very much.

Here's the mailing list for git:

  http://kerneltrap.org/mailarchive/15/overview/browse/month

Somewhere in there it told me where to find the sources, which include
a design document about how it works.  Ah, there it is:

  http://www.kernel.org/pub/software/scm/cogito/
  http://www.kernel.org/pub/software/scm/cogito/README

Basically, it assumes, deeply embedded, that if two blobs have the
same hash, they ARE THE SAME BLOB.  You can destroy its integrity by
feeding it various blobs which happen to hash to the same values.  He
seems to think that the only possible attack is that someone would go
in and modify the database by hand -- rather than feeding it new input
that confuses it.

	John

PS (added 25 Feb 2017):

If you assume NSA is six months or a year ahead of the open
academic/industrial sector in attacking SHA1, what would they have
already subverted using a similar attack?  Hmm, check the "cmp" and
"diff" sources!  If you don't trust the SHA1 hashes that say two trees
are the same, the second step is comparing the trees of files
directly.  Making an input pattern that causes cmp and diff to always
say, "yup, no differences here!" would allow any fraudulently inserted
modifications to spread much further.