why the first cuda kernel cannot overlap with previous memcpy?

Question

Here is a demo. The kernel cannot overlap with previous cudaMemcpyAsync, although they are in different streams.

#include 
#include 

__global__ void warmUp(){
    int Id = blockIdx.x*blockDim.x+threadIdx.x;
    if(Id == 0){
        printf("warm up!");
    }
}
__global__ void kernel(){
    int Id = blockIdx.x*blockDim.x+threadIdx.x;
    if(Id == 0){
        long long x = 0;
        for(int i=0; i<1000000; i++){
            x += i>>1;
        }
        printf("kernel!%d
", x);
    }
}

int main(){
    //warmUp<<<1,32>>>();
    int *data, *data_dev;
    int dataSize = pow(10, 7);
    cudaMallocHost(&data, dataSize*sizeof(int));
    cudaMalloc(&data_dev, dataSize*sizeof(int));
    
    cudaStream_t stream1, stream2;
    cudaStreamCreate(&stream1);
    cudaStreamCreate(&stream2);
    cudaMemcpyAsync(data_dev, data, dataSize*sizeof(int), cudaMemcpyHostToDevice, stream1);
    kernel<<<1, 32, 0, stream2>>>();
}

Visual Profiler show

After some attempts, I found out that this is due to it being the first kernel call.

Uncomment warmUp<<<1,32>>>();, Visual Profiler show, overlap!

Why?

Robert Crovella · Accepted Answer

CUDA uses lazy initialization. Because of this, the first time you do a particular operation or a particular operation type, it's possible that the behavior will not be as you expect.

The operation will/should work "correctly", but performance measurements may not be as you expect.

Contrary to the linked article, there really is no specified formula to force the lazy initialization to complete, without performing the actual work you intend to do.

If the only thing you ever intend to do with your application is launch a single kernel, then having that kernel overlap with a previous copy operation doesn't seem to make a lot of sense to me. In any event, you should expect that device initialization is necessary before all operations will proceed at expected speeds or in expected ways.

Lazy initialization behavior may vary based on CUDA version, platform (e.g. OS) and GPU type.

Additionally, kernel launches are asynchronous. So this particular coding pattern:

int main(){
    ...
    kernel<<<1, 32, 0, stream2>>>();
}

is generally not recommended in CUDA, and specifically is not recommended when using a profiler. Your code should provide the opportunity for all issued work to complete properly, in order for the profiler to provide useful results. You should provide a cudaDeviceSynchronize() or similar operation at the end of your code, if you want to profile it, for this type of pattern.

I also don't recommend doing performance analysis on kernels that are issuing printf calls. The printf call imposes additional host/device synchronization behavior/needs, and this can be confusing; its not easy to predict the performance impact of that.

why the first cuda kernel cannot overlap with previous memcpy?

Answers (1)

Related Questions