Reputation: 51
I'm using CUDA 7.0 with a nVidia 980 GTX for some image processing. In a particular iteration, multiple tiles are processed independently via 15-20 kernel calls and multiple cuFFT FFT/IFFT API calls.
Because of this, I've placed each tile within it's own CUDA stream so each tile executes it's string of operations asynchronously with respect to the host. Each tile is the same size within an iteration so they share a cuFFT plan. The host thread moves through the commands quickly in an attempt to keep the GPU loaded with work. I'm experiencing a periodic race condition while these operations are being processed in parallel though and had a question about cuFFT in particular. If I place a cuFFT plan in a stream 0 using cuFFTSetStream() for tile 0, and the FFT for tile 0 hasn't actually been executed on the GPU yet before the host sets the shared cuFFT plan's stream to stream 1 for tile 1 before it issues tile 1's work on the GPU, what is the behavior of cuFFTExec() for this plan?
More succinctly, does a call to cufftExec() execute in the stream the plan was set to at the time of the cufftExec() call regardless if cuFFTSetStream() is used to change the stream for subsequent tiles before the previous FFT calls have actually begun/completed?
I apologize for not posting code, but I'm not able to post my actual source.
Upvotes: 0
Views: 771
Reputation: 152249
EDIT: As pointed out in the comments, if the same plan (same created handle) is used for simultaneous FFT execution on the same device via streams, then the user is responsible for managing separate work areas for each usage of such plan. The question seemed to have a focus on the stream behavior itself, and my remaining answer focuses on that as well, but this is an important point.
If I place a cuFFT plan in a stream 0 using cuFFTSetStream() for tile 0, and the FFT for tile 0 hasn't actually been executed on the GPU yet before the host sets the shared cuFFT plan's stream to stream 1 for tile 1 before it issues tile 1's work on the GPU, what is the behavior of cuFFTExec() for this plan?
Let me pretend you said stream 1 and stream 2, just so we can avoid any possible confusion around the NULL stream.
CUFFT should respect the stream that was defined for the plan at the time the plan was passed to CUFFT via cufftExecXXX()
. Subsequent changes to the plan via cufftSetStream()
should have no effect on the stream used for previously issued cufftExecXXX()
calls.
We can verify this with a fairly simple test, using the profiler. Consider the following test code:
$ cat t1089.cu
// NOTE: this code omits independent work-area handling for each plan
// which is necessary for a plan that will be shared between streams
// and executed concurrently
#include <cufft.h>
#include <assert.h>
#include <nvToolsExt.h>
#define DSIZE 1048576
#define BATCH 100
int main(){
const int nx = DSIZE;
const int nb = BATCH;
size_t ws = 0;
cufftHandle plan;
cufftResult res = cufftCreate(&plan);
assert(res == CUFFT_SUCCESS);
res = cufftMakePlan1d(plan, nx, CUFFT_C2C, nb, &ws);
assert(res == CUFFT_SUCCESS);
cufftComplex *d;
cudaMalloc(&d, nx*nb*sizeof(cufftComplex));
cudaMemset(d, 0, nx*nb*sizeof(cufftComplex));
cudaStream_t s1, s2;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
res = cufftSetStream(plan, s1);
assert(res == CUFFT_SUCCESS);
res = cufftExecC2C(plan, d, d, CUFFT_FORWARD);
assert(res == CUFFT_SUCCESS);
res = cufftSetStream(plan, s2);
assert(res == CUFFT_SUCCESS);
nvtxMarkA("plan stream change");
res = cufftExecC2C(plan, d, d, CUFFT_FORWARD);
assert(res == CUFFT_SUCCESS);
cudaDeviceSynchronize();
return 0;
}
$ nvcc -o t1089 t1089.cu -lcufft -lnvToolsExt
$ cuda-memcheck ./t1089
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$
We're just doing two forward FFTs in a row, switching streams in-between the two. We'll use an nvtx marker to clearly identify the point at which the plan stream association change request occurs. Now let's look at the nvprof --print-api-trace
output (removing the lengthy start-up preamble):
983.84ms 617.00us cudaMalloc
984.46ms 21.628us cudaMemset
984.48ms 37.546us cudaStreamCreate
984.52ms 121.34us cudaStreamCreate
984.65ms 995ns cudaPeekAtLastError
984.67ms 996ns cudaConfigureCall
984.67ms 517ns cudaSetupArgument
984.67ms 21.908us cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [416])
984.69ms 349ns cudaGetLastError
984.69ms 203ns cudaPeekAtLastError
984.70ms 296ns cudaConfigureCall
984.70ms 216ns cudaSetupArgument
984.70ms 8.8920us cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [421])
984.71ms 272ns cudaGetLastError
984.71ms 177ns cudaPeekAtLastError
984.72ms 314ns cudaConfigureCall
984.72ms 229ns cudaSetupArgument
984.72ms 9.9230us cudaLaunch (void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [426])
984.73ms 295ns cudaGetLastError
984.77ms - [Marker] plan stream change
984.77ms 434ns cudaPeekAtLastError
984.78ms 357ns cudaConfigureCall
984.78ms 228ns cudaSetupArgument
984.78ms 10.642us cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [431])
984.79ms 287ns cudaGetLastError
984.79ms 193ns cudaPeekAtLastError
984.80ms 293ns cudaConfigureCall
984.80ms 208ns cudaSetupArgument
984.80ms 7.7620us cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [436])
984.81ms 297ns cudaGetLastError
984.81ms 178ns cudaPeekAtLastError
984.81ms 269ns cudaConfigureCall
984.81ms 214ns cudaSetupArgument
984.81ms 7.4130us cudaLaunch (void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [441])
984.82ms 312ns cudaGetLastError
984.82ms 152.63ms cudaDeviceSynchronize
$
We see that each FFT operation requires 3 kernel calls. In between, we see our nvtx marker indicating when the request for a plan stream change was made, and it's no surprise that this takes place after the first 3 kernel launches, but before the last 3. Finally, we note that essentially all of the execution time is absorbed in the final cudaDeviceSynchronize()
call. All of the preceding calls are asynchronous and so execute more-or-less "immediately" in the first millisecond of execution. The final synchronize absorbs all the processing time of the 6 kernels, amounting to about 150 milliseconds.
So if the cufftSetStream
were to have an effect on the first iteration of the cufftExecC2C()
call, we would expect to see some or all of the first 3 kernels launched into the same stream as that used for the last 3 kernels. But when we look at the nvprof --print-gpu-trace
output:
$ nvprof --print-gpu-trace ./t1089
==3757== NVPROF is profiling process 3757, command: ./t1089
==3757== Profiling application: ./t1089
==3757== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
974.74ms 7.3440ms - - - - - 800.00MB 106.38GB/s Quadro 5000 (0) 1 7 [CUDA memset]
982.09ms 23.424ms (25600 2 1) (32 8 1) 32 8.0000KB 0B - - Quadro 5000 (0) 1 13 void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [416]
1.00551s 21.172ms (25600 2 1) (32 8 1) 32 8.0000KB 0B - - Quadro 5000 (0) 1 13 void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [421]
1.02669s 27.551ms (25600 1 1) (16 16 1) 61 17.000KB 0B - - Quadro 5000 (0) 1 13 void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [426]
1.05422s 23.592ms (25600 2 1) (32 8 1) 32 8.0000KB 0B - - Quadro 5000 (0) 1 14 void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [431]
1.07781s 21.157ms (25600 2 1) (32 8 1) 32 8.0000KB 0B - - Quadro 5000 (0) 1 14 void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [436]
1.09897s 27.913ms (25600 1 1) (16 16 1) 61 17.000KB 0B - - Quadro 5000 (0) 1 14 void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [441]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$
we see that in fact the first 3 kernels are issued into the first stream, and the last 3 kernels are issued into the second stream, just as requested. (And the sum total execution time of all kernels is approximately 150ms, just as suggested by the api trace output.) Since the underlying kernel launches are asynchronous and are issued prior to the return of the cufftExecC2C()
call, if you think about this carefully you'll come to the conclusion that it has to be this way. The stream to launch a kernel into is specified at kernel launch time. (And of course I think this is considered "preferred" behavior.)
Upvotes: 2