Reputation: 9378
I'm trying to understand how well FPGAs can do SHA1 hashing.
For reference, SHA1 involves doing a series of 32-bit integer computations, arranged in 80 "steps"; here are 4 representative steps from the middle of the algorithm, in C:
x0 = rol(x13 ^ x8 ^ x2 ^ x0, 1);
e += rol(a,5) + (b^c^d) + x0 + 0x6ED9EBA1L;
b = rol(b,30);
x1 = rol(x14 ^ x9 ^ x3 ^ x1, 1);
c += rol(d,5) + (e^a^b) + x1 + 0x6ED9EBA1L;
e = rol(e,30);
x2 = rol(x13 ^ x10 ^ x4 ^ x2, 1);
b += rol(c,5) + (d^e^a) + x2 + 0x6ED9EBA1L;
d = rol(d,30);
x3 = rol(x13 ^ x11 ^ x5 ^ x3, 1)
a += rol(b,5) + (c^d^e) + x3 + 0x6ED9EBA1L;
c = rol(c,30);
There is a total of 21 internal 32-bit variables, and the algorithm keeps feeding them into each other. 'rol' is shift with rotation (shifting bits out of one end and into the other.)
Now, it would seem to me that computing x13 ^ x11 ^ x5 ^ x3 takes 32 LUTs, c^d^e takes another 32 LUTs, and I'm not clear on how to calculate the resources needed by the additions, but I'm guessing either 96 or 128 LUTs. Rotations and assignments are done through interconnects. So, let's say 192 LUTs total, times 80 steps, plus some overhead. Fully unrolled, I'd expect ~16,000 LUTs, with throughput of 1 input block per clock cycle and latency of 80-ish clock cycles. A Xilinx Artix-7 XC7A50T contains 8150 slices with 4 LUTs each, so I'd have throughput of 2 blocks per clock cycle, or 600 Mhash/s at 300 MHz (300 Gbps since each block is 512 bit.) Is that a reasonable estimate or am I way off?
I've not been able to find any references to fully unrolled SHA1 implementations, but these guys https://www.heliontech.com/fast_hash.htm claim a "very high performance" implementation with 828 LUTs and throughput of 1 block per 82 clock cycles, so, closer to 70 Gbps on a XC7A50T. Is this figure so much lower simply because they are not unrolled?
Upvotes: 2
Views: 1310
Reputation: 9378
Okay, I can answer my own question now. I've managed to put together a working SHA1 implementation in Verilog.
https://github.com/ekuznetsov139/fpga
This is actually a WPA2 PMK generator rather than just SHA1 (SHA1 executed in a loop 8192 times on the same data.)
I would not claim it to be perfectly optimized or even particularly well coded - I've learned all I know about Verilog in the last two weeks, in between other projects, and half of that time was spent on getting the data marshalled to and from multiple instances of the core over PCI-Express. But I got it working correctly in a simulator and had successful runs on an actual FPGA, and performance figures are close to my original projections. With a Cyclone V target, I consistently see about 7,000 ALMs per core, with each core capable of doing one hash per clocktick. One ALM is essentially 2 LUTs (either 1 large or 2 small) plus some carry adder hardware. So, 14,000 LUTs. Fmax seems to be around 300 MHz for fast silicon and closer to 150 MHz for slow silicon.
One thing I did not account for in my initial estimates is the need for lots of memory for the internal state. 21 32-bit variables times 80 steps is 53760 bit, and, with 4 registers per ALM, that alone would require more resources than all computations. But the compiler is able to pack most of that into memory cells, even if I don't instruct it to do it explicitly.
Routing/layout is a fairly big problem, though. I have a chip with 113K ALM (301K LE). The most I've been able to fit into it is 5 copies. That's less than 40% utilization. And that took ~8 hours of fitting. Going to try to mess with LogicLock to see if I can do better.
With 5 copies running at once at 300 MHz, the throughput would be 1.5 Ghash/s SHA1 or 90 Khash/s WPA2. Which is somewhat less than I hoped for (about 1/3rd of the throughput of a GeForce 980 Ti). But at least the energy efficiency is a lot better.
EDIT: One look at the Design Partition Planner in the standard edition of Quartus revealed the problem. The compiler, too smart for its own good, was merging internal storage arrays of each core, thus creating tons of unnecessary interconnects between cores.
Even without full LogicLock, just with "Allow shift register merging across hierarchies" set to "off", I have a successful fit with 10 copies. Let's see if I can do 12...
Upvotes: 3
Reputation:
Now, it would seem to me that computing
x13 ^ x11 ^ x5 ^ x3
takes 32 LUTs,c^d^e
takes another 32 LUTs, and I'm not clear on how to calculate the resources needed by the additions, but I'm guessing either 96 or 128 LUTs.
That would all be true if the XORs and addition were all independent -- but that isn't the case. Each LUT on a 7-series FPGA can take up to 6 inputs, so the synthesizer may be able to absorb some of the XORs into the addition chain.
That all being said, routing and layout will be your largest obstacle. To make use of the carry chain, all of the bits in a wide adder have to be laid out "vertically". This causes the pipeline to naturally flow from left to right, but I doubt the XC7A50T has enough columns to fit the entire pipeline in a single row. Routing resources will be the limiting factor, not LUTs.
Upvotes: 4