cudaMemcpy to host for device-allocated memory still not possible?

Question

I have a data structure with pointers (think linked lists). Its size can't be determined before launching the kernel that reads the input data. So I allocate data on the device during input processing.
However, trying to copy that data back to host fails. From what I could gather, this is because there is a limitation in CUDA that does not allow device-allocated memory to be accessed by the runtime API. That information, however, was for CUDA 4 with "a fix coming soon". Does anyone know if that fix or a workaround ever came? I can't seem to find any recent information on this.

Here's a reproducible example:

#include 

__device__ int *devData;

__global__ void initKernel()
{
    devData = new int[6];
    devData[0] = 0;
    devData[1] = 1;
    devData[2] = 2;
    devData[3] = 3;
    devData[4] = 4;
    devData[5] = 5;
}

__global__ void printKernel()
{
    printf("Testing device: %d
", devData[3]);
}

int main()
{
    initKernel<<<1,1>>>();
    cudaDeviceSynchronize();

    printKernel<<<1,1>>>();
    cudaDeviceSynchronize();

    int *devAddr;
    cudaGetSymbolAddress((void **)&devAddr, devData);

    int *hostData = new int[6];
    cudaMemcpy(hostData, devAddr, 6*sizeof(int), cudaMemcpyDeviceToHost)); //cudaErrorInvalidValue (invalid argument)
    //same error with: cudaMemcpyFromSymbol(hostData, devData, 6*sizeof(int));

    printf("Testing host: %d
", testHost[3]);

    return 0;
}

This throws a cudaErrorInvalidValue for cudaMemcpy (same for cudaMemcpyFromSymbol). This does not throw an error when I use __device__ int devData[6]; instead of __device__ int *devData; and prints 3 as expected.

Robert Crovella · Accepted Answer

It's still not possible.

This is documented in the programming guide.

In addition, device malloc() memory cannot be used in any runtime or driver API calls (i.e. cudaMemcpy, cudaMemset, etc).

If you have data in allocations that were created by in-kernel malloc() that you wish to transfer to the host, you will need to transfer that data first to a device memory allocation (or managed allocation), before copying to host or using in host code.

The same comments and all aspects of usage for in-kernel malloc apply equally to in-kernel new as well as in-kernel cudaMalloc.

cudaMemcpy to host for device-allocated memory still not possible?

Answers (1)

Related Questions