Reputation: 305
I believe the following code carry out a typical
workflow.
What I found that was very strange is when I used Trace Application option by the NSight Profiler, in the report, with 'stack trace' turned on, I found out that the most expensive operation is the line in bold, and just that line, while other memoCopy operation cost almost only as 10% or less of this memoCopy operation.
Is this because it is the first line after calling the kernel hence the profiler somehow included the cost of some synchronization into the cost of this particular memoCopy operation?
For the problem like the one I'm working on, which requires very frequent synchronization and 'returning' the result to host, can anyone offer some general advice on the best practice? I was thinking in particularly about two options, which I am not so sure if will eventually help
{
int numP = p_psPtr->P.size();
int numL = p_psPtr->L.size();
// Out partition is in Unit of the Number of Particles
int block_dim = BLOCK_DIM_X;
int grid_dim = numP/block_dim + (numP%block_dim == 0 ? 0:1);
vector<Particle> pVec(p_psPtr->P.begin(), p_psPtr->P.end());
Particle *d_part_arr = 0;
Particle *part_arr = pVec.data();
HANDLE_ERROR(cudaMalloc((void**)&d_part_arr, numP * sizeof(Particle)));
HANDLE_ERROR(cudaMemcpy(d_part_arr, part_arr, numP * sizeof(Particle), cudaMemcpyHostToDevice));
vector<SpringLink> lVec(p_psPtr->L.begin(), p_psPtr->L.end());
SpringLink *d_link_arr = 0;
SpringLink *link_arr = lVec.data();
HANDLE_ERROR(cudaMalloc((void**)&d_link_arr, numL * sizeof(SpringLink)));
HANDLE_ERROR(cudaMemcpy(d_link_arr, link_arr, numL * sizeof(SpringLink), cudaMemcpyHostToDevice));
Point3D *d_oriPos_arr = 0;
Point3D *oriPos_arr = p_originalPos.data();
HANDLE_ERROR(cudaMalloc((void**)&d_oriPos_arr, numP * sizeof(Point3D)));
HANDLE_ERROR(cudaMemcpy(d_oriPos_arr, oriPos_arr, numP * sizeof(Point3D), cudaMemcpyHostToDevice));
Vector3D *d_oriVel_arr = 0;
Vector3D *oriVel_arr = p_originalVel.data();
HANDLE_ERROR(cudaMalloc((void**)&d_oriVel_arr, numP * sizeof(Vector3D)));
HANDLE_ERROR(cudaMemcpy(d_oriVel_arr, oriVel_arr, numP * sizeof(Vector3D), cudaMemcpyHostToDevice));
Point3D *d_updPos_arr = 0;
Point3D *updPos_arr = p_updatedPos.data();
HANDLE_ERROR(cudaMalloc((void**)&d_updPos_arr, numP * sizeof(Point3D)));
HANDLE_ERROR(cudaMemcpy(d_updPos_arr, updPos_arr, numP * sizeof(Point3D), cudaMemcpyHostToDevice));
Vector3D *d_updVel_arr = 0;
Vector3D *updVel_arr = p_updatedVel.data();
HANDLE_ERROR(cudaMalloc((void**)&d_updVel_arr, numP * sizeof(Vector3D)));
HANDLE_ERROR(cudaMemcpy(d_updVel_arr, updVel_arr, numP * sizeof(Vector3D), cudaMemcpyHostToDevice));
int *d_converged_arr = 0;
int *converged_arr = &p_converged[0];
HANDLE_ERROR(cudaMalloc((void**)&d_converged_arr, numP * sizeof(int)));
HANDLE_ERROR(cudaMemcpy(d_converged_arr, converged_arr, numP * sizeof(int), cudaMemcpyHostToDevice));
// Run the function on the device
handleParticleKernel<<<grid_dim, block_dim>>>(d_part_arr, d_link_arr, numP,
d_oriPos_arr, d_oriVel_arr, d_updPos_arr, d_updVel_arr,
d_converged_arr, p_innerLoopIdx, p_dt);
**HANDLE_ERROR(cudaMemcpy(oriPos_arr, d_oriPos_arr, numP * sizeof(Point3D), cudaMemcpyDeviceToHost));**
HANDLE_ERROR(cudaMemcpy(oriVel_arr, d_oriVel_arr, numP * sizeof(Vector3D), cudaMemcpyDeviceToHost));
HANDLE_ERROR(cudaMemcpy(updPos_arr, d_updPos_arr, numP * sizeof(Point3D), cudaMemcpyDeviceToHost));
HANDLE_ERROR(cudaMemcpy(updVel_arr, d_updVel_arr, numP * sizeof(Vector3D), cudaMemcpyDeviceToHost));
HANDLE_ERROR(cudaMemcpy(converged_arr, d_converged_arr, numP * sizeof(int), cudaMemcpyDeviceToHost));
}
Upvotes: 0
Views: 669
Reputation: 4194
That particular cudaMemcpy
call takes longer because it has wait until your kernel completes. If you add in a cudaDeviceSynchronize
after the kernel, your perceived execution time of that cudaMemcpy
call should be in line with all of the others. (Of course, that additional time you're seeing will instead be spent in your cudaDeviceSynchronize
call).
However, the time you spend in cudaDeviceSynchronize
is somewhat of a fundamental cost that you can't really get around; if you need to use an output from your kernel, then you'll have to wait until the kernel is done executing. Since kernels launches are asynchronous, you can execute un-related statements while the kernel is running; however, in your case, the very next call is copying one of the outputs of your kernel to host memory, so you have to wait for the kernel to finish in order to get the data.
If your program permits, you can try breaking up your kernel launch and memory transfers into chunks and launching them using different streams, although the viability of this is contingent on a couple of factors (i.e. your kernel might not decompose well into independent parts). If you do go this route, the best case scenario would be like this (taken from the CUDA Best Practices Docs)
This would allow you to overlap data transfers with kernel execution, which serves to hide some of the data transfer costs. You can achieve similar asynchrony with zero-copy, just be forewarned that such transfers aren't cached, so depending on your kernel access patterns, you can end up getting lower throughputs.
Upvotes: 2