anti
anti

Reputation: 3125

returning data from a global cuda function?

I am starting out with cuda, and am trying a simple example where I send two arrays into a global function, copy one to the other, and return the second one.

I have:

__global__
void add(int n, int *tri, int *y)
{
    int index = threadIdx.x;
    int stride = blockDim.x;
    for (int i = index; i < n; i += stride)
        y[i] = tri[i];
}

and:

   //local copy of data
    int *tri2 = tri; // data checked, and is valid

    int *y = new int[width * height]; // same size as `tri`
    int N = width * height;

    // Allocate Unified Memory – accessible from CPU or GPU
    cudaMallocManaged(&tri2, N * sizeof(int));
    cudaMallocManaged(&y, N * sizeof(int));

    // initialize y array on the host
    for (int i = 0; i < N; i++) {
        y[i] = 2;
    }

    // Run kernel on the GPU
    add << <1, 256 >> >(N, tri2, y);

    // Wait for GPU to finish before accessing on host
    cudaDeviceSynchronize();

    //copy back to host
    int i = 0;
    int f = -999.0; /* CPU copy of value */
    cudaMemcpy(&f, &y[i], sizeof(int), cudaMemcpyDeviceToHost);

    std::cout << "back: " << f << std::endl;
    std::cout << "orig: " << tri[i] << std::endl;

The orig value is 128, the same as when it went in. the returned f value is always 0. What am i missing?

Upvotes: 0

Views: 1737

Answers (1)

Abator Abetor
Abator Abetor

Reputation: 2598

The values of array tri are different from the values of array tri2.

With

cudaMallocManaged(&tri2, N * sizeof(int));

you allocate new memory on the device, which I assume just happens to be zero. Then in the kernel you copy this array of zeros to y. The values of array tri are never copied.

Here is some example how you could do it. (untested)

int* tri = ....

int* tri_managed;
//allocate new managed memory and save pointer in tri_managed
cudaMallocManaged(&tri_managed, sizeof(int) * N);
//now copy entries of tri to tri_managed
for(int i = 0; i < N; i++)
    tri_managed[i] = tri[i];

int* y;
cudaMallocManaged(&y, N * sizeof(int));

// initialize y array
for (int i = 0; i < N; i++) {
    y[i] = 2;
}

//copy entries of tri_managed to y
add << <1, 256 >> >(N, tri_managed, y);

// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();

//copy back to host
int i = 0;
int f = -999.0; /* CPU copy of value */
//cudaMemcpy(&f, &y[i], sizeof(int), cudaMemcpyDeviceToHost);
//since managed memory is accessible on host and device, we can just to this
f = y[i];

std::cout << "back: " << f << std::endl;
std::cout << "orig: " << tri[i] << std::endl;

//don't forget to free memory after usage
cudaFree(tri_managed);
cudaFree(y);

Upvotes: 1

Related Questions