kangshiyin
kangshiyin

Reputation: 9781

Why Nvidia Visual Profile shows overlapped data transfer in the timeline for purely synchronized code?

The timeline generated by Nsight Visual Profile looks very strange. I don't write any transfer overlapping code, but you can see overlap between MemCpy and Compute kernels.

This makes me unable to debug the real overlapping code.

I use CUDA 5.0, Tesla M2090, Centos 6.3, 2x CPU Xeon E5-2609

Anyone has the similar problem? Does it occur only on certain linux distributions? How to fix it?

This is the code.

#include <cuda.h>
#include <curand.h>
#include <cublas_v2.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/device_ptr.h>

int main()
{
    cublasHandle_t hd;
    curandGenerator_t rng;
    cublasCreate(&hd);
    curandCreateGenerator(&rng, CURAND_RNG_PSEUDO_MTGP32);

    const size_t m = 5000, n = 1000;
    const double alpha = 1.0;
    const double beta = 0.0;

    thrust::host_vector<double> h(n * m, 0.1);
    thrust::device_vector<double> a(m * n, 0.1);
    thrust::device_vector<double> b(n * m, 0.1);
    thrust::device_vector<double> c(m * m, 0.1);
    cudaDeviceSynchronize();

    for (int i = 0; i < 10; i++)
    {
        curandGenerateUniformDouble(rng,
                thrust::raw_pointer_cast(&a[0]), a.size());
        cudaDeviceSynchronize();

        thrust::copy(h.begin(), h.end(), b.begin());
        cudaDeviceSynchronize();

        cublasDgemm(hd, CUBLAS_OP_N, CUBLAS_OP_N,
                m, m, n, &alpha,
                thrust::raw_pointer_cast(&a[0]), m,
                thrust::raw_pointer_cast(&b[0]), n,
                &beta,
                thrust::raw_pointer_cast(&c[0]), m);
        cudaDeviceSynchronize();
    }

    curandDestroyGenerator(rng);
    cublasDestroy(hd);

    return 0;
}

This is profile timeline captured.

timeline

Upvotes: 0

Views: 439

Answers (1)

Greg Smith
Greg Smith

Reputation: 11529

Compute Capability 2.* (Fermi) devices are capable of both kernel level concurrency and kernel and copy concurrency. In order to trace concurrent kernels the kernel start and end timestamps are collected in a separate clock domain than the memory copy timestamps. The tool is responsible for correlating these different clocks. In your screenshot I believe there is a scaling factoring different (bad correlation) as you can see each memory copy is not off by a constant value but is off by a scaled offset.

If you use the option --concurrent-kernels off in nvprof I think the problem will disappear. When concurrent kernels are disabled the memory copy and kernel timing use the same source clock for timestamps.

Compute Capability 3.* (Kepler) and 5.* (Maxwell) have a different mechanism for timing compute kernels. For these devices it is possible in tools to see overlap with the end timestamp of a kernel and the start of a memory copy or kernel. The work does not overlap. There was a design decision in the tools between having potential for overlap (usually <500ns) or introduction of this as a constant overhead between dependent work. The tools decided to avoid introduction of the overhead at the cost of potentially showing very small level of overlap on serialized work.

Upvotes: 1

Related Questions