Reputation: 9781
The timeline generated by Nsight Visual Profile looks very strange. I don't write any transfer overlapping code, but you can see overlap between MemCpy
and Compute
kernels.
This makes me unable to debug the real overlapping code.
I use CUDA 5.0, Tesla M2090, Centos 6.3, 2x CPU Xeon E5-2609
Anyone has the similar problem? Does it occur only on certain linux distributions? How to fix it?
This is the code.
#include <cuda.h>
#include <curand.h>
#include <cublas_v2.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/device_ptr.h>
int main()
{
cublasHandle_t hd;
curandGenerator_t rng;
cublasCreate(&hd);
curandCreateGenerator(&rng, CURAND_RNG_PSEUDO_MTGP32);
const size_t m = 5000, n = 1000;
const double alpha = 1.0;
const double beta = 0.0;
thrust::host_vector<double> h(n * m, 0.1);
thrust::device_vector<double> a(m * n, 0.1);
thrust::device_vector<double> b(n * m, 0.1);
thrust::device_vector<double> c(m * m, 0.1);
cudaDeviceSynchronize();
for (int i = 0; i < 10; i++)
{
curandGenerateUniformDouble(rng,
thrust::raw_pointer_cast(&a[0]), a.size());
cudaDeviceSynchronize();
thrust::copy(h.begin(), h.end(), b.begin());
cudaDeviceSynchronize();
cublasDgemm(hd, CUBLAS_OP_N, CUBLAS_OP_N,
m, m, n, &alpha,
thrust::raw_pointer_cast(&a[0]), m,
thrust::raw_pointer_cast(&b[0]), n,
&beta,
thrust::raw_pointer_cast(&c[0]), m);
cudaDeviceSynchronize();
}
curandDestroyGenerator(rng);
cublasDestroy(hd);
return 0;
}
This is profile timeline captured.
Upvotes: 0
Views: 439
Reputation: 11529
Compute Capability 2.* (Fermi) devices are capable of both kernel level concurrency and kernel and copy concurrency. In order to trace concurrent kernels the kernel start and end timestamps are collected in a separate clock domain than the memory copy timestamps. The tool is responsible for correlating these different clocks. In your screenshot I believe there is a scaling factoring different (bad correlation) as you can see each memory copy is not off by a constant value but is off by a scaled offset.
If you use the option --concurrent-kernels off
in nvprof I think the problem will disappear. When concurrent kernels are disabled the memory copy and kernel timing use the same source clock for timestamps.
Compute Capability 3.* (Kepler) and 5.* (Maxwell) have a different mechanism for timing compute kernels. For these devices it is possible in tools to see overlap with the end timestamp of a kernel and the start of a memory copy or kernel. The work does not overlap. There was a design decision in the tools between having potential for overlap (usually <500ns) or introduction of this as a constant overhead between dependent work. The tools decided to avoid introduction of the overhead at the cost of potentially showing very small level of overlap on serialized work.
Upvotes: 1