Using loop to measure execution time of cuFFT, the relationship between time and how many loops is not linear

Question

Basically I want to measure the time cost by cuFFT function by putting the cuFFT execution function in a for loop, here is the code I used first time (This is the simple example used in the Nvidia website for CUDA):

By the way, my CPU is Intel I7-3630QM 2.40GHz, and GPU is Nvidia NVS 5200M. The platform I used is Visual Studio 2012 and CUDA 5.5, operation system is Windows 7, 64 bits.

#include "cuda_runtime.h"
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define NX 1024
#define NY 1024

int main(int argc, char** argv) {

int i;
int Iter;
cufftHandle plan;//A data structure named plan containing all information needed for Fourier Transform.
cufftComplex *data1;//data structure to store the real value and complex value of the input and output of Fourier Transform.

cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY);//Prepare the NX*NY 2D Fourier Transform by alloc input values on GPU Memory

cufftPlan2d(&plan, NX, NY, CUFFT_C2C);//Prepare 2D Fourier Transform (NX*NY), type is C2C that is complex to complex.

Iter = 1000;

clock_t begin, end;
double cost;
begin = clock();

for (i = 0;i



This program returns the time normally 0.030s, if I change the value of Iter (how many loops) to 1100, the results turned to be 0.033s, and if Iter = 1200, the result is 0.036s, which seems linear.

This keeps correct until the Iter = 1500, the time when Iter = 1500 is 0.195s,when Iter = 1600, time = 0.431s.

I don't understand why the time cost is like this, anyone can help me?

Thank you in advance.

Robert Crovella · Accepted Answer

Modify your code as follows:

cudaDeviceSynchronize();  // add this line
end = clock();

And I believe you'll get sane results.

The CUFFT functions are asynchronous so they can support streamed overlap of copy and compute. That means they return before the underlying GPU operation is complete. So your for-loop is in effect queueing up a large number of transforms to be performed one after the other. But they are not necessarily finished by the time you complete your timing. The complex behavior you are observing I believe is related to exceeding an internal queue depth of the number of kernel launches that can be queued up, before additional requests must wait for queue slots to open up. But that isn't the central issue.

The central issue is that your timing method is flawed. This is just another example of the dangers inherent in using host-based timing methods to time asynchronous GPU activity.

Using loop to measure execution time of cuFFT, the relationship between time and how many loops is not linear

Answers (1)

Related Questions