mfeuling
mfeuling

Reputation: 51

CUDA cuFFT API behavior in concurrent streams

I'm using CUDA 7.0 with a nVidia 980 GTX for some image processing. In a particular iteration, multiple tiles are processed independently via 15-20 kernel calls and multiple cuFFT FFT/IFFT API calls.

Because of this, I've placed each tile within it's own CUDA stream so each tile executes it's string of operations asynchronously with respect to the host. Each tile is the same size within an iteration so they share a cuFFT plan. The host thread moves through the commands quickly in an attempt to keep the GPU loaded with work. I'm experiencing a periodic race condition while these operations are being processed in parallel though and had a question about cuFFT in particular. If I place a cuFFT plan in a stream 0 using cuFFTSetStream() for tile 0, and the FFT for tile 0 hasn't actually been executed on the GPU yet before the host sets the shared cuFFT plan's stream to stream 1 for tile 1 before it issues tile 1's work on the GPU, what is the behavior of cuFFTExec() for this plan?

More succinctly, does a call to cufftExec() execute in the stream the plan was set to at the time of the cufftExec() call regardless if cuFFTSetStream() is used to change the stream for subsequent tiles before the previous FFT calls have actually begun/completed?

I apologize for not posting code, but I'm not able to post my actual source.

Upvotes: 0

Views: 771

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 152249

EDIT: As pointed out in the comments, if the same plan (same created handle) is used for simultaneous FFT execution on the same device via streams, then the user is responsible for managing separate work areas for each usage of such plan. The question seemed to have a focus on the stream behavior itself, and my remaining answer focuses on that as well, but this is an important point.

If I place a cuFFT plan in a stream 0 using cuFFTSetStream() for tile 0, and the FFT for tile 0 hasn't actually been executed on the GPU yet before the host sets the shared cuFFT plan's stream to stream 1 for tile 1 before it issues tile 1's work on the GPU, what is the behavior of cuFFTExec() for this plan?

Let me pretend you said stream 1 and stream 2, just so we can avoid any possible confusion around the NULL stream.

CUFFT should respect the stream that was defined for the plan at the time the plan was passed to CUFFT via cufftExecXXX(). Subsequent changes to the plan via cufftSetStream() should have no effect on the stream used for previously issued cufftExecXXX() calls.

We can verify this with a fairly simple test, using the profiler. Consider the following test code:

$ cat t1089.cu
// NOTE: this code omits independent work-area handling for each plan
// which is necessary for a plan that will be shared between streams
// and executed concurrently
#include <cufft.h>
#include <assert.h>
#include <nvToolsExt.h>

#define DSIZE 1048576
#define BATCH 100

int main(){

  const int nx = DSIZE;
  const int nb = BATCH;
  size_t ws = 0;
  cufftHandle plan;
  cufftResult res = cufftCreate(&plan);
  assert(res == CUFFT_SUCCESS);
  res = cufftMakePlan1d(plan, nx, CUFFT_C2C, nb, &ws);
  assert(res == CUFFT_SUCCESS);
  cufftComplex *d;
  cudaMalloc(&d, nx*nb*sizeof(cufftComplex));
  cudaMemset(d, 0, nx*nb*sizeof(cufftComplex));
  cudaStream_t s1, s2;
  cudaStreamCreate(&s1);
  cudaStreamCreate(&s2);
  res = cufftSetStream(plan, s1);
  assert(res == CUFFT_SUCCESS);
  res = cufftExecC2C(plan, d, d, CUFFT_FORWARD);
  assert(res == CUFFT_SUCCESS);
  res = cufftSetStream(plan, s2);
  assert(res == CUFFT_SUCCESS);
  nvtxMarkA("plan stream change");
  res = cufftExecC2C(plan, d, d, CUFFT_FORWARD);
  assert(res == CUFFT_SUCCESS);
  cudaDeviceSynchronize();
  return 0;
}


$ nvcc -o t1089 t1089.cu -lcufft -lnvToolsExt
$ cuda-memcheck ./t1089
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$

We're just doing two forward FFTs in a row, switching streams in-between the two. We'll use an nvtx marker to clearly identify the point at which the plan stream association change request occurs. Now let's look at the nvprof --print-api-trace output (removing the lengthy start-up preamble):

983.84ms  617.00us  cudaMalloc
984.46ms  21.628us  cudaMemset
984.48ms  37.546us  cudaStreamCreate
984.52ms  121.34us  cudaStreamCreate
984.65ms     995ns  cudaPeekAtLastError
984.67ms     996ns  cudaConfigureCall
984.67ms     517ns  cudaSetupArgument
984.67ms  21.908us  cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [416])
984.69ms     349ns  cudaGetLastError
984.69ms     203ns  cudaPeekAtLastError
984.70ms     296ns  cudaConfigureCall
984.70ms     216ns  cudaSetupArgument
984.70ms  8.8920us  cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [421])
984.71ms     272ns  cudaGetLastError
984.71ms     177ns  cudaPeekAtLastError
984.72ms     314ns  cudaConfigureCall
984.72ms     229ns  cudaSetupArgument
984.72ms  9.9230us  cudaLaunch (void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [426])
984.73ms     295ns  cudaGetLastError
984.77ms         -  [Marker] plan stream change
984.77ms     434ns  cudaPeekAtLastError
984.78ms     357ns  cudaConfigureCall
984.78ms     228ns  cudaSetupArgument
984.78ms  10.642us  cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [431])
984.79ms     287ns  cudaGetLastError
984.79ms     193ns  cudaPeekAtLastError
984.80ms     293ns  cudaConfigureCall
984.80ms     208ns  cudaSetupArgument
984.80ms  7.7620us  cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [436])
984.81ms     297ns  cudaGetLastError
984.81ms     178ns  cudaPeekAtLastError
984.81ms     269ns  cudaConfigureCall
984.81ms     214ns  cudaSetupArgument
984.81ms  7.4130us  cudaLaunch (void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [441])
984.82ms     312ns  cudaGetLastError
984.82ms  152.63ms  cudaDeviceSynchronize
$

We see that each FFT operation requires 3 kernel calls. In between, we see our nvtx marker indicating when the request for a plan stream change was made, and it's no surprise that this takes place after the first 3 kernel launches, but before the last 3. Finally, we note that essentially all of the execution time is absorbed in the final cudaDeviceSynchronize() call. All of the preceding calls are asynchronous and so execute more-or-less "immediately" in the first millisecond of execution. The final synchronize absorbs all the processing time of the 6 kernels, amounting to about 150 milliseconds.

So if the cufftSetStream were to have an effect on the first iteration of the cufftExecC2C() call, we would expect to see some or all of the first 3 kernels launched into the same stream as that used for the last 3 kernels. But when we look at the nvprof --print-gpu-trace output:

$ nvprof --print-gpu-trace ./t1089
==3757== NVPROF is profiling process 3757, command: ./t1089
==3757== Profiling application: ./t1089
==3757== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
974.74ms  7.3440ms                    -               -         -         -         -  800.00MB  106.38GB/s  Quadro 5000 (0)         1         7  [CUDA memset]
982.09ms  23.424ms          (25600 2 1)        (32 8 1)        32  8.0000KB        0B         -           -  Quadro 5000 (0)         1        13  void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [416]
1.00551s  21.172ms          (25600 2 1)        (32 8 1)        32  8.0000KB        0B         -           -  Quadro 5000 (0)         1        13  void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [421]
1.02669s  27.551ms          (25600 1 1)       (16 16 1)        61  17.000KB        0B         -           -  Quadro 5000 (0)         1        13  void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [426]
1.05422s  23.592ms          (25600 2 1)        (32 8 1)        32  8.0000KB        0B         -           -  Quadro 5000 (0)         1        14  void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [431]
1.07781s  21.157ms          (25600 2 1)        (32 8 1)        32  8.0000KB        0B         -           -  Quadro 5000 (0)         1        14  void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [436]
1.09897s  27.913ms          (25600 1 1)       (16 16 1)        61  17.000KB        0B         -           -  Quadro 5000 (0)         1        14  void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [441]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$

we see that in fact the first 3 kernels are issued into the first stream, and the last 3 kernels are issued into the second stream, just as requested. (And the sum total execution time of all kernels is approximately 150ms, just as suggested by the api trace output.) Since the underlying kernel launches are asynchronous and are issued prior to the return of the cufftExecC2C() call, if you think about this carefully you'll come to the conclusion that it has to be this way. The stream to launch a kernel into is specified at kernel launch time. (And of course I think this is considered "preferred" behavior.)

Upvotes: 2

Related Questions