Why Nvidia Visual Profile shows overlapped data transfer in the timeline for purely synchronized code?

Question

The timeline generated by Nsight Visual Profile looks very strange. I don't write any transfer overlapping code, but you can see overlap between MemCpy and Compute kernels.

This makes me unable to debug the real overlapping code.

I use CUDA 5.0, Tesla M2090, Centos 6.3, 2x CPU Xeon E5-2609

Anyone has the similar problem? Does it occur only on certain linux distributions? How to fix it?

This is the code.

#include 
#include 
#include 
#include 
#include 
#include 

int main()
{
    cublasHandle_t hd;
    curandGenerator_t rng;
    cublasCreate(&hd);
    curandCreateGenerator(&rng, CURAND_RNG_PSEUDO_MTGP32);

    const size_t m = 5000, n = 1000;
    const double alpha = 1.0;
    const double beta = 0.0;

    thrust::host_vector h(n * m, 0.1);
    thrust::device_vector a(m * n, 0.1);
    thrust::device_vector b(n * m, 0.1);
    thrust::device_vector c(m * m, 0.1);
    cudaDeviceSynchronize();

    for (int i = 0; i < 10; i++)
    {
        curandGenerateUniformDouble(rng,
                thrust::raw_pointer_cast(&a[0]), a.size());
        cudaDeviceSynchronize();

        thrust::copy(h.begin(), h.end(), b.begin());
        cudaDeviceSynchronize();

        cublasDgemm(hd, CUBLAS_OP_N, CUBLAS_OP_N,
                m, m, n, &alpha,
                thrust::raw_pointer_cast(&a[0]), m,
                thrust::raw_pointer_cast(&b[0]), n,
                &beta,
                thrust::raw_pointer_cast(&c[0]), m);
        cudaDeviceSynchronize();
    }

    curandDestroyGenerator(rng);
    cublasDestroy(hd);

    return 0;
}

This is profile timeline captured.

timeline

Why Nvidia Visual Profile shows overlapped data transfer in the timeline for purely synchronized code?

Answers (1)

Related Questions