<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Sat, Sep 19, 2015 at 11:38 AM, Rob S. <span dir="ltr"><<a href="mailto:rob.schneier1@gmail.com" target="_blank" onclick="window.open('https://mail.google.com/mail/?view=cm&tf=1&to=rob.schneier1@gmail.com&cc=&bcc=&su=&body=','_blank');return false;">rob.schneier1@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Tony Arcieri's post is full of misconceptions and mistakes.<br></blockquote><div><br></div><div>Oh really!</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Most of the recent Kummer implementations (see<br>

<a href="http://eprint.iacr.org/2012/670.pdf" rel="noreferrer" target="_blank">http://eprint.iacr.org/2012/670.pdf</a> and<br>

<a href="http://eprint.iacr.org/2014/134.pdf" rel="noreferrer" target="_blank">http://eprint.iacr.org/2014/134.pdf</a>) are fully optimized in assembly.</blockquote><div><br></div><div>Cool claim, it's both vague and not true in places where it really matters.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Still, I see that FourQ is significantly faster when looking at different</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

64-bit processors (check out Table 5 in the FourQ paper,<br>

<a href="http://eprint.iacr.org/2015/565.pdf" rel="noreferrer" target="_blank">http://eprint.iacr.org/2015/565.pdf</a>). If one looks across different CPUs<br>

(not only one CPU in particular)</blockquote><div><br></div><div>Funny thing, I know you really don't want to for some reason, but if we do look at one CPU architecture, there's a rather important one that Kummer hasn't been optimized for. I'll just quote djb...</div><div><br></div><div>On Tue, Sep 15, 2015 at 8:49 AM, D. J. Bernstein <span dir="ltr"><<a href="mailto:djb@cr.yp.to" target="_blank" onclick="window.open('https://mail.google.com/mail/?view=cm&tf=1&to=djb@cr.yp.to&cc=&bcc=&su=&body=','_blank');return false;">djb@cr.yp.to</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">The critical statement is "59,000" Haswell cycles for FourQ, compared to<br>60556 Haswell cycles (reported by eBATS) for Kummer.<br><br>What's amusing about this is that Haswell is the only platform where we<br>didn't bother writing an asm implementation for Kummer---this is a very<br>simple C implementation with intrinsics. Anyone want to bet on what the<br>results of an asm implementation will be?</blockquote></div></div><div><br></div><div>I guess your point is why should we care about Haswell? Well, I certainly care about Haswell and its progeny even if you don't...</div><div><br></div><div>djb has several other criticisms of the performance metrics in the FourQ paper. Microsoft also has a history of over-embellishing the performance of e.g. the NUMS curves. Perhaps some independent verification is needed before the figures in their paper are taken at face value?</div><div><br></div><div>This is why benchmarking systems like SUPERCOP are nice.</div><div><br></div>-- <br><div class="gmail_signature">Tony Arcieri<br></div>

</div></div>