Julio César
Julio César

Reputation: 13266

CUDAfy CopyFromDevice several orders of magnitude slower than CopyToDevice

I'm testing CUDAfy with a small gravity simulation and after running a profiler on the code I see that most of the time is spent on the CopyFromDevice method of the GPU. Here's the code:

    private void WithGPU(float dt)
    {
        this.myGpu.CopyToDevice(this.myBodies, this.myGpuBodies);
        this.myGpu.Launch(1024, 1, "MoveBodies", -1, dt, this.myGpuBodies);
        this.myGpu.CopyFromDevice(this.myGpuBodies, this.myBodies);
    }

Just to clarify, this.myBodies is an array with 10,000 structs like the following:

[Cudafy(eCudafyType.Struct)]
[StructLayout(LayoutKind.Sequential)]
internal struct Body
{
    public float Mass;

    public Vector Position;

    public Vector Speed;
}

And Vector is a struct with two floats X and Y.

According to my profiler the average timings for those three lines are 0.092, 0.192 and 222.873 ms. These timings where taken on a Windows 7 with a NVIDIA NVS 310.

Is there a way to improve the time of the CopyFromDevice() method?

Thank you

Upvotes: 1

Views: 459

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151934

CUDA kernel launches are asynchronous. This means that immediately after launching the kernel, the CPU thread is released to process the code immediately following the kernel launch, while the kernel is still executing.

If the subsequent code contains any sort of CUDA execution barrier, then the CPU thread will then stop at the barrier until the kernel execution is complete. In CUDA, both cudaMemcpy (the operation underlying the cudafy CopyFromDevice method) and cudaDeviceSynchronize (the operation underlying the cudafy Synchronize method) contain execution barriers.

Therefore, from a host code perspective, such a barrier immediately following a kernel launch will appear to halt CPU thread execution for the duration of the kernel execution.

For this reason, the particular barrier in this example will include both the kernel execution time, as well as the data copy time. You can use the Synchronize barrier method immediately after the kernel launch to disambiguate the timing indicated by profiling the host code.

Upvotes: 2

Related Questions