Reputation: 49
I have a rather strange observation on the following code snippet.
When I do both - copy memory to device and copy results back to host the streams seem to be synronized - i.e. they execute the kernel sequentially. Once I remove the copy to host and keep copy the parameters to the device the streams execute in parallel, once I remove copying the parameters and keep copying the results the streams also execute in parallel.
Any Idea why? and how to solve the problem?
for (int j=0; j<n_streams; j++) {
cuMemcpyHtoDAsync(gpu_parameters[j], parameters[j].asPointer(), (parameterCount) * Sizeof.FLOAT, stream[j]);
Pointer kernelParameters1 = Pointer.to(
Pointer.to(new int[]{0}),
Pointer.to(new int[] {10000}),
Pointer.to(gpu_data),
Pointer.to(gpu_results[j]),
Pointer.to(gpu_parameters[j])
);
cuLaunchKernel(function[j],
s_grid, 1, 1, // Grid dimension
s_block, 1, 1, // Block dimension
0, stream[j], // Shared memory size and stream
kernelParameters1, null // Kernel- and extra parameters
);
cuMemcpyDtoHAsync(results[j].asPointer(), gpu_results[j], (results[j].size()) * Sizeof.FLOAT, stream[j]);
}
Upvotes: 1
Views: 504
Reputation: 49
No Idea why ... but changing the sequence removed the problem - and is executing in parallel....
for (int j=0; j<n_streams; j++) {
cuMemcpyHtoDAsync(gpu_parameters[j], parameters[j].asPointer(), (parameterCount) * Sizeof.FLOAT, stream[j]);
}
for (int j=0; j<n_streams; j++) {
Pointer kernelParameters1 = Pointer.to(
Pointer.to(new int[]{0}),
Pointer.to(new int[] {getNPrices()}),
Pointer.to(get_gpu_prices()),
Pointer.to(gpu_results[j]),
Pointer.to(gpu_parameters[j])
//,Pointer.to(new int[]{0})
);
cuLaunchKernel(function[j],
s_grid, 1, 1, // Grid dimension
s_block, 1, 1, // Block dimension
0, stream[j], // Shared memory size and stream
kernelParameters1, null // Kernel- and extra parameters
);
}
for (int j=0; j<n_streams; j++) {
cuMemcpyDtoHAsync(results[j].asPointer(), gpu_results[j], (results[j].size()) * Sizeof.FLOAT, stream[j]);
}
Upvotes: 1