eLe
eLe

Reputation: 49

Cannot use cuMemcpyHtoDAsync and cuMemcpyDtoHAsync at the same time

I have a rather strange observation on the following code snippet.

When I do both - copy memory to device and copy results back to host the streams seem to be synronized - i.e. they execute the kernel sequentially. Once I remove the copy to host and keep copy the parameters to the device the streams execute in parallel, once I remove copying the parameters and keep copying the results the streams also execute in parallel.

Any Idea why? and how to solve the problem?

for (int j=0; j<n_streams; j++) {
    cuMemcpyHtoDAsync(gpu_parameters[j], parameters[j].asPointer(), (parameterCount) * Sizeof.FLOAT, stream[j]);
    Pointer kernelParameters1 = Pointer.to(
            Pointer.to(new int[]{0}),
            Pointer.to(new int[] {10000}),
            Pointer.to(gpu_data),
            Pointer.to(gpu_results[j]),
            Pointer.to(gpu_parameters[j])
            );
    cuLaunchKernel(function[j],
            s_grid, 1, 1,      // Grid dimension
            s_block, 1, 1,      // Block dimension
            0, stream[j],               // Shared memory size and stream
            kernelParameters1, null // Kernel- and extra parameters
            );
    cuMemcpyDtoHAsync(results[j].asPointer(), gpu_results[j], (results[j].size()) * Sizeof.FLOAT, stream[j]);
}

Upvotes: 1

Views: 504

Answers (1)

eLe
eLe

Reputation: 49

No Idea why ... but changing the sequence removed the problem - and is executing in parallel....

for (int j=0; j<n_streams; j++) {
    cuMemcpyHtoDAsync(gpu_parameters[j], parameters[j].asPointer(), (parameterCount) * Sizeof.FLOAT, stream[j]);
}
for (int j=0; j<n_streams; j++) {
    Pointer kernelParameters1 = Pointer.to(
            Pointer.to(new int[]{0}),
            Pointer.to(new int[] {getNPrices()}),
            Pointer.to(get_gpu_prices()),
            Pointer.to(gpu_results[j]),
            Pointer.to(gpu_parameters[j])
            //,Pointer.to(new int[]{0})
            );
    cuLaunchKernel(function[j],
            s_grid, 1, 1,      // Grid dimension
            s_block, 1, 1,      // Block dimension
            0, stream[j],               // Shared memory size and stream
            kernelParameters1, null // Kernel- and extra parameters
            );
}
for (int j=0; j<n_streams; j++) {
    cuMemcpyDtoHAsync(results[j].asPointer(), gpu_results[j], (results[j].size()) * Sizeof.FLOAT, stream[j]);
}

Upvotes: 1

Related Questions