Reputation: 25
From cuda cpu function - gpu kernel overlap ,I know how to execute the gpu and cpu functions concurrently. But here is another situation, the gpu and cpu functions have to execute serially, the problem is when cpu is blocking by gpu kernel executing, would the cpu process suspend? If yes, the occupancy rate of cpu should be low, right?
Below is my cuda code, quite simple, just for test
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
__global__ void kernel(float *d_data)
{
//dead loop
while(1)
{
*d_data = -1;
*d_data = 1/(*d_data);
*d_data = (*d_data) / (*d_data);
}
}
int main()
{
float *d_data;
cudaMalloc(&d_data, sizeof(float));
kernel << <1, 1 >> >(d_data);
//cpu process would be blocking here
float data;
cudaMemcpy(&data, d_data, sizeof(int), cudaMemcpyDeviceToHost);
printf("%f\n",data);
return 0;
}
Using top
to check the occupancy rate of cpu is 100%
%Cpu10 : 75.1 us, 24.9 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
and I have confirmed that the cpu process I launch is running on Cpu10.
Am I missing something? I am very grateful for your help!
Upvotes: 1
Views: 113
Reputation: 151944
The CPU process (thread, actually) doesn't suspend.
The cudaMemcpy
operation after the kernel call is issued to the same cuda stream (the default stream) so it blocks (the CPU thread) and waits for the kernel to complete.
The block inside the cudaMemcpy
call is (by default) a CPU spin wait, not a thread yield.
In theory, you can modify the CUDA device synchronization behavior. However you will need to experiment with the flags to see if any options give you anything preferable to the default behavior.
If you have an idea how long the kernel will execute, you can also use a function like sleep()
in CPU code to yield the thread for a specific period of time, then use a mechanism like cudaEventQuery
to determine whether to proceed or continue waiting.
Upvotes: 2