WorkinChina
WorkinChina

Reputation: 25

Why can't I use gpu to reduce the cpu occupancy rate?

From cuda cpu function - gpu kernel overlap ,I know how to execute the gpu and cpu functions concurrently. But here is another situation, the gpu and cpu functions have to execute serially, the problem is when cpu is blocking by gpu kernel executing, would the cpu process suspend? If yes, the occupancy rate of cpu should be low, right?

Below is my cuda code, quite simple, just for test

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>

__global__ void kernel(float *d_data)
{
    //dead loop
    while(1)
    {
        *d_data = -1;
        *d_data = 1/(*d_data);
        *d_data = (*d_data) / (*d_data);
    }
}


int main()
{
    float *d_data;
    cudaMalloc(&d_data, sizeof(float));
    kernel << <1, 1 >> >(d_data);
    //cpu process would be blocking here
    float data;
    cudaMemcpy(&data, d_data, sizeof(int), cudaMemcpyDeviceToHost);
    printf("%f\n",data);
    return 0;
}

Using top to check the occupancy rate of cpu is 100%

%Cpu10 : 75.1 us, 24.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

and I have confirmed that the cpu process I launch is running on Cpu10.

Am I missing something? I am very grateful for your help!

Upvotes: 1

Views: 113

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151944

The CPU process (thread, actually) doesn't suspend.

The cudaMemcpy operation after the kernel call is issued to the same cuda stream (the default stream) so it blocks (the CPU thread) and waits for the kernel to complete.

The block inside the cudaMemcpy call is (by default) a CPU spin wait, not a thread yield.

In theory, you can modify the CUDA device synchronization behavior. However you will need to experiment with the flags to see if any options give you anything preferable to the default behavior.

If you have an idea how long the kernel will execute, you can also use a function like sleep() in CPU code to yield the thread for a specific period of time, then use a mechanism like cudaEventQuery to determine whether to proceed or continue waiting.

Upvotes: 2

Related Questions