SilverEnsign99
SilverEnsign99

Reputation: 23

Fill device array consecutively in CUDA

(This might be more of a theoretical parallel optimization problem then a CUDA specific problem per se. I'm very new to Parallel Programming in general so this may just be personal ignorance.)

I have a workload that consists of a 64-bit binary numbers upon which I run analysis. If the analysis completes successfully then that binary number is a "valid solution". If the analysis breaks midway then the number is "invalid". The end goal is to get a list of all the valid solutions.

Now there are many trillions of 64 bit binary numbers I am analyzing, but only ~5% or less will be valid solutions, and they usually come in bunches (i.e. every consecutive 1000 numbers are valid and then every random billion or so are invalid). I can't find a pattern to the space between bunches so I can't ignore the large chunks of invalid solutions.

Currently, every thread in a kernel call analyzes just one number. If the number is valid it denotes it as such in it's respective place on a device array. Ditto if it's invalid. So basically I generate a data point for very value analyzed regardless if it's valid or not. Then once the array is full I copy it to host only if a valid solution was found (denoted by a flag on the device). With this, overall throughput is greatest when the array is the same size as the # of threads in the grid.

But Copying Memory to & from the GPU is expensive time wise. That said what I would like to do is copy data over only when necessary; I want to fill up a device array with only valid solutions and then once the array is full then copy it over from the host. But how do you consecutively fill an array up in a parallel environment? Or am I approaching this problem the wrong way?

EDIT 1

This is the Kernel I initially developed. As you see I am generating 1 byte of data for each value analyzed. Now I really only need each 64 bit number which is valid; if I need be I can make a new kernel. As suggested by some of the commentators I am currently looking into stream compaction.

__global__ void kValid(unsigned long long*kInfo, unsigned char*values, char *solutionFound) {
    //a 64 bit binary value to be evaluated is called a kValue
    unsigned long long int kStart, kEnd, kRoot, kSize, curK;
    //kRoot is the kValue at the start of device array, this is used is the device array is larger than the total threads in the grid
    //kStart is the kValue to start this kernel call on
    //kEnd is the last kValue to validate
    //kSize is how many bits long is kValue (we don't necessarily use all 64 bits but this value stays constant over the entire chunk of values defined on the host
    //curK is the current kValue represented as a 64 bit unsigned integer

    int rowCount, kBitLocation, kMirrorBitLocation, row, col, nodes, edges; 

    kStart = kInfo[0];
    kEnd = kInfo[1];
    kRoot = kInfo[2];
    nodes = kInfo[3];
    edges = kInfo[4];
    kSize = kInfo[5];

    curK = blockIdx.x*blockDim.x + threadIdx.x + kStart;
    if (curK > kEnd) {//check to make sure you don't overshoot the end value
        return;
    }

    kBitLocation = 1;//assuming the first bit in the kvalue has a position 1;
    for (row = 0; row < nodes; row++) {
        rowCount = 0;
        kMirrorBitLocation = row;//the bit position for the mirrored kvals is always starts at the row value (assuming the first row has a position of 0)
        for (col = 0; col < nodes; col++) {
            if (col > row) {
                if (curK & (1 << (unsigned long long int)(kSize - kBitLocation))) {//add one to kIterator to convert to counting space
                    rowCount++;
                }
                kBitLocation++;
            }
            if (col < row) {
                if (col > 0) {
                    kMirrorBitLocation += (nodes - 2) - (col - 1);
                }
                if (curK & (1 << (unsigned long long int)(kSize - kMirrorBitLocation))) {//if bit is set
                    rowCount++;
                }
            }
        }
        if (rowCount != edges) {
            //set the ith bit to zero
            values[curK - kRoot] = 0;
            return;
        }
    }
    //set the ith bit to one
    values[curK - kRoot] = 1;
    *solutionFound = 1; //not a race condition b/c it will only ever be set to 1 by any thread.
}

Upvotes: 0

Views: 375

Answers (2)

Milhous
Milhous

Reputation: 14653

So, you would want to have each thread analyze multiple numbers (thousands or millions) before you do a return from the computation. So if you analyze a million numbers in your thread, you will only need %5 of that amount of space to possible hold the results of that computation.

Upvotes: 1

einpoklum
einpoklum

Reputation: 131978

(This answer assumes output order is inconsequential and so are the positions of the valid values.)

Conceptually, your analysis produces a set of valid values. The implementation you described uses a dense representation of this set: One bit for every potential value. Yet you've indicated that the data is quite sparse (either 5e-2 or 1000/10^9 = 1e-6); moreover, copying data across PCI express is quite a pain.

Well, then, why not consider a sparse representation? The simplest one would be merely an unordered sequence of the valid values. Of course, writing that requires some synchronization across threads - perhaps even across blocks. Roughly, you can have warps collect their valid values in shared memory; then synchronize at the block level to collect the block's valid values (for a given chunk of the input it has analyzed); and finally use atomics to collect the data from all the blocks.

Oh, also - have each thread analyze multiple values, so you don't have to do that much synchronization.

Upvotes: 2

Related Questions