Hannes
Hannes

Reputation: 304

Save Time needed for cudaHostAlloc

I'm trying to find out if it makes sense to copy data to pinned memory before transferring it to the device since I have no influence on the allocation of my input data (it's a lib).

std::vector<int> idata(WORK_SIZE);
int *idata_aligned = NULL;
int *d1 = NULL;
int *d2 = NULL;

for (int i = 0; i < WORK_SIZE; ++i)
    idata[i] = i;
CUDA_CHECK_RETURN(cudaMalloc((void**) &d1, sizeof(int) * WORK_SIZE));
CUDA_CHECK_RETURN(cudaMalloc((void**) &d2, sizeof(int) * WORK_SIZE));

printf("unpinned:\n");
{
    boost::timer::auto_cpu_timer t;
    CUDA_CHECK_RETURN(cudaMemcpy(d1, &idata[0], sizeof(int) * WORK_SIZE, cudaMemcpyHostToDevice));
}

printf("copy to pinned:\n");
{
    boost::timer::auto_cpu_timer t;
    CUDA_CHECK_RETURN(cudaHostAlloc((void**) &idata_aligned, sizeof(int) * WORK_SIZE,cudaHostAllocDefault));
    memcpy(idata_aligned, &idata[0], sizeof(int) * WORK_SIZE);
    CUDA_CHECK_RETURN(cudaMemcpy(d2, idata_aligned, sizeof(int) * WORK_SIZE, cudaMemcpyHostToDevice));
}

Output for 10,000,000 elements:

unpinned:
 0.018919s wall, 0.020000s user + 0.000000s system = 0.020000s CPU (105.7%)
copy to pinned:
 0.045428s wall, 0.020000s user + 0.020000s system = 0.040000s CPU (88.1%)

The main problem seems to be the cudaHostAlloc (even without memcpy the second approach is much slower).

Did I do something wrong? Is there another way to use pinned memory for already allocated memory?

Upvotes: 4

Views: 2684

Answers (1)

rico
rico

Reputation: 184

The speed gain of using pinned memory is also dependent on the size of the transfer and your system. You could run the CudaBandwidthTest example first to see whether it actually makes sense.

Otherwise I'd measure the specific parts of your program to see where the time is lost. (alloc, memcpy, pcie-transfer)

Depending on the size of the memory you are allocating page-locked it can also happen that your system needs to swap some other memory to disk, increasing the runtime.

Anyway, the times you show are pretty short, so I guess, the transfer sizes are small, too. You should be able to also gain speed by combining multiple small writes to a bigger one.

Upvotes: 3

Related Questions