absentmindeduk
absentmindeduk

Reputation: 136

OpenCL SHA1 Throughput Optimisation

Hoping someone more experienced in OpenCL usage may be able to help me here! I'm doing a project (to help me learn a bit more crypto and to try my hand at GPGPU programming) where I'm trying to implement my own SHA-1 algorighm.

Ultimately my question is about maximizing my throughput rates. At present I'm seeing something like 56.1 MH/sec, which compares very badly to open source programs I've looked at, such as John the Ripper and OCLHashcat, which are giving 1,000 and 1,500 MH/sec respectively (heck, I'd be well-chuffed with a 3rd of that!).

So, what I'm doing

I've written a SHA-1 implementation in an OpenCL kernel and a C++ host application to load data to the GPU (using CL 1.2 C++ wrapper). I'm generating blocks of candidate data to hash in a threaded fashion on the CPU and loading this data onto the global GPU memory using the CL C++ call to enqueueWriteBuffer (using uchars to represent the bytes to hash):

errorCode = dispatchQueue->enqueueWriteBuffer(
        inputBuffer,
        CL_FALSE,//CL_TRUE,
        0,
        sizeof(cl_uchar) * inputBufferSize,
        passwordBuffer,
        NULL,
        &dispatchDelegate);

I'm en-queuing data using enqueueNDRangeKernel in the following manner (where global worksize is a user-defined variable, at present I've set this to my GPUs maximum flattened global worksize of 16.777 million per run):

errorCode = dispatchQueue->enqueueNDRangeKernel(
        *kernel,
        NullRange,
        NDRange(globalWorkgroupSize, 1), 
        NullRange, 
        NULL,
        NULL);

This means that (per dispatch) I load 16.777 million items in a 1D array and index from my kernel into this using get_global_offset(0).

My Kernel signature:

    __kernel void sha1Crack(__global uchar* out, __global uchar* in, 
                            __constant int* passLen, __constant int* targetHash, 
                            __global bool* collisionFound)
    {
        //Kernel Instance Global GPU Mem IO Mapping:
        __private int id = get_global_id(0);
        __private int inputIndexStart = id * passwordLen;

        //Select Password input key space:
        #pragma unroll
        for (i = 0; i < passwordLen; i++)
        {
            inputMem[i] = in[inputIndexStart + i];
        }

        //SHA1 Code omitted for brevity...
    }

So, given all this: am I doing something fundamentally wrong in the way I'm loading data? I.e. 1 call to enqueueNDrange for 16.7 million kernel executions over a 1D input vector? Should I be using a 2-D space and sub-dividing into localworkgroup ranges? I tried playing with this but it didn't seem quicker.

Or, perhaps as likely is my algorithm itself the source of slowness? I've spent a good while optimizing it and manually unrolling all of the loop stages using pre-processor directives.

I've read about memory coalescing on the hardware. Could that be my issue? :S

Any advice at all appreciated! If I've missed anything important please let me know and I'll update.

Thanks in advance! ;)


Update: 16,777,216 is the device maximum reported workgroup size; 256**3. The global array of boolean values is one boolean. It's set to false at the start of the kernel enqueue, then a branching statement sets this to true if a collision is found only - will that force a convergence? passwordLen is the length of the current input value and target hash is an int[4] encoded hash to check against.

Upvotes: 2

Views: 1395

Answers (1)

David Higgins
David Higgins

Reputation: 11

Your 'maximum flattened global worksize' should be multiplied by passwordLen. It is the number of kernels you can run, not the maximal length of an input array. You can most likely send much more data than this to the GPU.

Other potential issues: the 'generating blocks of candidate data to hash in a threaded fashion on the CPU', try doing this in advance of the kernel iterations to see whether the delay is in the generation of the data blocks or in the processing of the kernels; your sha1 algorithm is the other obvious potential issue. I'm not sure how much you've really optimised it by 'unrolling' the loops, usually the bigger optimisation issue is 'if' statements (if a single kernel instance within a workgroup tests to true then all of the lockstepped workgroup instances must follow that branch in parallel).

And DarkZeros is correct, you should manually play with the local workgroup size making it the highest common multiple of the global size and the number of kernels which can be run at once on the card. The easiest way to do this is to round up the global work group size to the next multiple of the card capacity and use an external if{} statement in the kernel only running the kernel for global_id less than the actual number of kernels you want to run.

Dave.

Upvotes: 1

Related Questions