Reputation: 983
I am trying to understand CUDA streams and I have made my first program with streams, but It is slower than usual kernel function...
why is this code slower
cudaMemcpyAsync(pole_dev, pole, size, cudaMemcpyHostToDevice, stream_1);
addKernel<<<count/100, 100, 0, stream_1>>>(pole_dev);
cudaMemcpyAsync(pole, pole_dev, size, cudaMemcpyDeviceToHost, stream_1);
cudaThreadSynchronize(); // I don't know difference between cudaThreadSync and cudaDeviceSync
cudaDeviceSynchronize(); // it acts relatively same...
than:
cudaMemcpy(pole_dev, pole, size, cudaMemcpyHostToDevice);
addKernel<<<count/100, 100>>>(pole_dev);
cudaMemcpy(pole, pole_dev, size, cudaMemcpyDeviceToHost);
I thounght that it should run faster ... value of variable count is 6 500 000 (maximum) ... first source code takes 14 millisecconds and second source code takes 11 milliseconds.
Can anybody explain it to me, please?
Upvotes: 0
Views: 1811
Reputation: 5930
In this snippet you like dealing with only a single stream (stream_1
), but that's actually what CUDA automatically does for you when you don't explicitely manipulate streams.
To take advantage of streams and asynchronous memory transfers, you need to use several streams, and split your data and computations through each of them.
Upvotes: 2