dirtbirb
dirtbirb

Reputation: 81

Can I run a CUDA device function without parallelization or calling it as part of a kernel?

I have a program that loads an image onto a CUDA device, analyzes it with cufft and some custom stuff, and updates a single number on the device which the host then queries as needed. The analysis is mostly parallelized, but the last step sums everything up (using thrust::reduce) for a couple final calculations that aren't parallel.

Once everything is reduced, there's nothing to parallelize, but I can't figure out how to just run a device function without calling it as its own tiny kernel with <<<1, 1>>>. That seems like a hack. Is there a better way to do this? Maybe a way to tell the parallelized kernel "just do these last lines once after the parallel part is finished"?

I feel like this must have been asked before, but I can't find it. Might just not know what to search for though.

Code snip below, I hope I didn't remove anything relevant:

float *d_phs_deltas;        // Allocated using cudaMalloc (data is on device)
__device__ float d_Z;   

static __global__ void getDists(const cufftComplex* data, const bool* valid, float* phs_deltas)
{
    const int i = blockIdx.x*blockDim.x + threadIdx.x;

    // Do stuff with the line indicated by index i
    // ...

    // Save result into array, gets reduced to single number in setDist
    phs_deltas[i] = phs_delta;
}

static __global__ void setDist(const cufftComplex* data, const bool* valid, const float* phs_deltas)
{
    // Final step; does it need to be it's own kernel if it only runs once??
    d_Z += phs2dst * thrust::reduce(thrust::device, phs_deltas, phs_deltas + d_y);

    // Save some other stuff to refer to next frame
    // ...
}

void fftExec(unsigned __int32 *host_data)
    {
        // Copy image to device, do FFT, etc
        // ...

        // Last parallel analysis step, sets d_phs_deltas
        getDists<<<out_blocks, N_THREADS>>>(d_result, d_valid, d_phs_deltas);

        // Should this be a serial part at the end of getDists somehow?
        setDist<<<1, 1>>>(d_result, d_valid, d_phs_deltas);
    }

// d_Z is copied out only on request
void getZ(float *Z) { cudaMemcpyFromSymbol(Z, d_Z, sizeof(float)); }

Thank you!

Upvotes: 0

Views: 426

Answers (1)

talonmies
talonmies

Reputation: 72372

There is no way to run a device function directly without launching a kernel. As pointed out in comments, there is a working example in the Programming Guide which shows how to use memory fence functions and an atomically incremented counter to signal that a given block is the last block:

__device__ unsigned int count = 0; 

__global__ void sum(const float* array, unsigned int N, volatile float* result) 
{
    __shared__ bool isLastBlockDone; 

    float partialSum = calculatePartialSum(array, N); 

    if (threadIdx.x == 0) {     
        result[blockIdx.x] = partialSum; 

        // Thread 0 makes sure that the incrementation 
        // of the "count" variable is only performed after 
        // the partial sum has been written to global memory. 
        __threadfence(); 

        // Thread 0 signals that it is done. 
        unsigned int value = atomicInc(&count, gridDim.x); 

        // Thread 0 determines if its block is the last 
        // block to be done. 
        isLastBlockDone = (value == (gridDim.x - 1)); 
    }

    // Synchronize to make sure that each thread reads 
    // the correct value of isLastBlockDone. 
    __syncthreads(); 

    if (isLastBlockDone) { 
        // The last block sums the partial sums 
        // stored in result[0 .. gridDim.x-1] float totalSum = 
        calculateTotalSum(result); 
        if (threadIdx.x == 0) { 
            // Thread 0 of last block stores the total sum 
            // to global memory and resets the count 
            // varilable, so that the next kernel call 
            // works properly. 
            result[0] = totalSum; 
            count = 0; 
        } 
    } 
}

I would recommend benchmarking both ways and choosing which is faster. On most platforms kernel launch latency is only a few microseconds, so a short running kernel to finish an action after a long running kernel can be the most efficient way to get this done.

Upvotes: 1

Related Questions