<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Sun, May 1, 2016 at 8:00 AM, Henry Baker <span dir="ltr"><<a href="mailto:hbaker1@pipeline.com" target="_blank">hbaker1@pipeline.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">sha1sum took 24 seconds.<br>

sha3sum (default algorithm) took 54 seconds.<br>

sha256sum took 54 seconds.<br>

b2sum-i686-linux took 35.7 seconds.<br>

b2sum-amd64-linux took 27.3 seconds.<br></blockquote><div><br></div><div>This shows a major problem we face in Linux distros: we like everyone to run the same binary, so everyone is forced to use the oldest supported CPU instruction set.<br></div><div><br></div><div>The program sha1sum is from the coreutils package, which AFAICT contains zero vector-optimization of any kind.  Here's what I get with my version of b2sum, compiled with AVX2 support, vs sha1sum shipping with Ubuntu.  randfile is a 300MiB file, already cached:</div><div><br></div><div><div>$ time sha1sum randfile </div><div>30b42c8894b108d65db90090c98c0a9c8cd63cb9  randfile</div><div><br></div><div>real<span class="" style="white-space:pre">        </span>0m0.845s</div><div>user<span class="" style="white-space:pre">       </span>0m0.784s</div><div>sys<span class="" style="white-space:pre">        </span>0m0.056s</div></div><div><br></div><div><div>$ time b2sum randfile</div><div>e2cb7410dcbe11930909f144da7c2121f22100d7825614d640fa63e14a2da01265da779030a250e718ed30250221157992567d7cee4c4b4a28f77bcbbe4df514  randfile</div><div><br></div><div>real<span class="" style="white-space:pre">        </span>0m0.432s</div><div>user<span class="" style="white-space:pre">       </span>0m0.396s</div><div>sys<span class="" style="white-space:pre">        </span>0m0.036s</div></div><div><br></div><div>BLAKE2 is almost twice as fast, and the parallel version is faster (for large hashing, not < 1KiB):</div><div><br></div><div><div>$ time b2sum -a blake2bp randfile </div><div>4d33a9488a3a197a7179350b7c000296c231129679bc11ab024b11fda1f583cb957980e4c8e8cd6fb751ad3406842e54e7246675118d857342dbc8a60e4a84f2  randfile</div><div><br></div><div>real<span class="" style="white-space:pre">    </span>0m0.324s</div><div>user<span class="" style="white-space:pre">       </span>0m0.996s</div><div>sys<span class="" style="white-space:pre">        </span>0m0.068s</div></div><div><br></div><div><br></div><div>Not only that, but Samuel Neves (who wrote the optimized BLAKE2 code) has an optimized version of BLAKE2bp using more of the available parallelism per core to get around 1 byte/cycle throughput.</div><div><br></div><div>Now, any sane person who needs a whole lot of speed and has a processor supporting SSE4 or newer instruction set should just use BLAKE2.  That said, the future looks bright for even more speed using HighwahHash-like parallel multiplications and byte shuffles.  We're seeing some pretty sick speed in prototype code.</div><div><br></div><div>Bill</div></div></div></div>