solvingPuzzles
solvingPuzzles

Reputation: 8889

Equivalent of usleep() in CUDA kernel?

I'd like to call something like usleep() inside a CUDA kernel. The basic goal is to make all GPU cores sleep or busywait for a number of millesconds--it's part of some sanity checks that I want to do for a CUDA application. My attempt at doing this is below:

#include <unistd.h>
#include <stdio.h>
#include <cuda.h>
#include <sys/time.h>

__global__ void gpu_uSleep(useconds_t wait_time_in_ms)
{
    usleep(wait_time_in_ms);
}

int main(void)
{
    //input parameters -- arbitrary
    //   TODO: set these exactly for full occupancy
    int m = 16;
    int n = 16;
    int block1D = 16;
    dim3 block(block1D, block1D);
    dim3 grid(m/block1D, n/block1D);

    useconds_t wait_time_in_ms = 1000;

    //execute the kernel
    gpu_uSleep<<< grid, block >>>(wait_time_in_ms);
    cudaDeviceSynchronize();

    return 0;
}

I get the following error when I try to compile this using NVCC:

error: calling a host function("usleep") from a __device__/__global__ 
       function("gpu_uSleep") is not allowed

Clearly, I'm not allowed to use a host function such as usleep() inside a kernel. What would be a good alternative to this?

Upvotes: 15

Views: 15882

Answers (4)

redolent
redolent

Reputation: 4259

The best way to "sleep the cores" is to have the kernel return, to the CPU, and then have the CPU just launch a second kernel (or the same kernel again). This prevents the GPUs from having to spin & overheat.

Upvotes: 1

einpoklum
einpoklum

Reputation: 131544

With recent versions of CUDA, and a device with Compute Capability 7.0 or later (Volta, Turing, Ampere etc.), you can use the __nanosleep() primitive:

void __nanosleep(unsigned ns);

which obviates the need for busy-sleeping as suggested in older answers.

Upvotes: 10

Greg Smith
Greg Smith

Reputation: 11529

You can spin on clock() or clock64(). The CUDA SDK concurrentKernels sample does this does the following:

__global__ void clock_block(clock_t *d_o, clock_t clock_count)
{
    clock_t start_clock = clock();
    clock_t clock_offset = 0;
    while (clock_offset < clock_count)
    {
        clock_offset = clock() - start_clock;
    }
     d_o[0] = clock_offset;
}

I recommend using clock64(). clock() and clock64() are in cycles so you will have to query the frequency using cudaDeviceProperties(). The frequency can be dynamic so it will be hard to guarantee an accurate spin loop.

Upvotes: 26

Roger Dahl
Roger Dahl

Reputation: 15734

You can busy wait with a loop that reads clock().

To wait for at least 10,000 clock cycles:

clock_t start = clock();
clock_t now;
for (;;) {
  now = clock();
  clock_t cycles = now > start ? now - start : now + (0xffffffff - start);
  if (cycles >= 10000) {
    break;
  }
}
// Stored "now" in global memory here to prevent the
// compiler from optimizing away the entire loop.
*global_now = now;

Note: This is untested. The code that handles overflows was borrowed from this answer by @Pedro. See his answer and section B.10 in the CUDA C Programming Guide 4.2 for details on how clock() works. There is also a clock64() command.

Upvotes: 11

Related Questions