Reputation: 811
I'm trying to create a large number of sha256 hashes quickly on a T4 machine. The T4 has a 'sha256' instruction which allows me to calculate a hash in one op code. I created an inline assembly template to call the sha256 opcode:
in my c++ code:
extern "C"
{
void ProcessChunk(const char* buf, uint32_t* state);
}
pchunk.il:
.inline ProcessChunk,8
.volatile
/* copy state */
ldd [%o1],%f0 /* load 8 bytes */
ldd [%o1 + 8],%f2 /* load 8 bytes */
ldd [%o1 +16],%f4 /* load 8 bytes */
ldd [%o1 +24],%f6 /* load 8 bytes */
/* copy data */
ldd [%o0],%f8 /* load 8 bytes */
ldd [%o0+8],%f10 /* load 8 bytes */
ldd [%o0+16],%f12 /* load 8 bytes */
ldd [%o0+24],%f14 /* load 8 bytes */
ldd [%o0+32],%f16 /* load 8 bytes */
ldd [%o0+40],%f18 /* load 8 bytes */
ldd [%o0+48],%f20 /* load 8 bytes */
ldd [%o0+56],%f22 /* load 8 bytes */
sha256
nop
std %f0, [%o1]
std %f2, [%o1+8]
std %f4, [%o1+16]
std %f6, [%o1+24]
.end
Things are working great in a single threaded environment but it is not fast enough. I used openmp to parallelize the application so that I can call ProcessChunk simultaneously. The multithreaded version of the application works OK for a few threads but when I increase the number of threads (16 for example) I begin to get bogus results. The inputs to the ProcessChunk function are both stack variables local to each thread. I've confirmed that the inputs are generated correctly no matter the number of threads. If I put ProcessChunk into a critical section, I get correct results but the performance degrades significantly (single thread performs better). I'm stumped on what the problem might be. Is it possible for solaris threads to step on floating point registers of another thread?
Any ideas how I can debug this?
Regards
Update:
I changed the code to use quad sized (16 byte) load and saves:
.inline ProcessChunk,8
.volatile
/* copy state */
ldq [%o1], %f0
ldq [%o1 +16],%f4
/* copy data */
ldq [%o0], %f8
ldq [%o0+16],%f12
ldq [%o0+32],%f16
ldq [%o0+48],%f20
lzd %o0,%o0
nop
stq %f0, [%o1]
stq %f4, [%o1+16]
.end
At first glance the issue seems to have gone away. The performance degrades significantly after 32 threads so that is the number I'm sticking with (for the moment at least) and with the current code I seem to be getting correct results. I probably just masked the issue so I'm going to run further tests.
Update 2:
I found some time to go back to this and I was able to get decent results from the T4 (10s of millions of hashes in a minute).
The changes I made were:
I packed everything up in a library and made the code available here
Upvotes: 4
Views: 718
Reputation: 2621
Not a Spark architecture expert (I might be wrong) but here's my guess:
Your inline assembly code loads the stack variable into a set of specific floating point registers to be able to call the sha asssembly operation.
How does this work for two threads? Both calls to ProcessChunk will try to copy different input values into the very same CPU registers.
The way I normally see it, is that CPU registers in asm code are like "global" variables for an high level programming language.
How many cores does your system have? Maybe you are fine until you have a thread per core/set of hardware registers. But that also imply the behavior of the code could be dependent on the way the threads are scheduled on the different cores of your system.
Do you know how the system behaves when it schedules threads from the same process on a CPU core? What I mean is: does the system store the registers of the unscheduled thread, like in a context switch?
A test I would run is to spawn a number of thread equals to the N of CPU cores and then run the same test with N+1 (my assumption here is that there is a floating point register set per CPU core).
Upvotes: 1