Lorcan O'Brien
Lorcan O'Brien

Reputation: 7

CUDAMemcpy Makes no sense to me... Why do I specify device memory in normal C++?

EDIT: I may have found a better way, using CUDAMalloc(); I guess this was a bad question, but I'll try Malloc and if that seems to be more logical, then I'll close this.

So I'm able to write in CUDA C/C++ a bit, but I was looking at the CUDAMemcpy Syntax, and I see it copies to the specified Device. But Why do I specify two pieces of memory on the Host in the function call, like in this badly written example (I know it doesn't init the values...), where I tell it to copy h_array1/h_array2 to the respective device array. Why is it necessary to create the d_Array in host memory, or is it?

Here is the code:

#include <cuda.h>
#include <iostream>

using std::cout;

unsigned long int arraysize = 20;

__global__ void CUDAAddArray(float* arrayfloat, float* arrayfloat2){
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    arrayfloat[idx] += arrayfloat2[idx];
    //end cuda kernel __global__ void CUDAAddArray();
}

int main() {
    float* h_array1 = new float[arraysize];
    float* h_array2 = new float[arraysize];

    float* d_array1 = new float[arraysize];
    float* d_array2 = new float[arraysize];

    cudaMemcpy(d_array1, h_array1, sizeof(float)*arraysize, cudaMemcpyHostToDevice);
    cudaMemcpy(d_array2, h_array2, sizeof(float)*arraysize, cudaMemcpyHostToDevice);

    CUDAAddArray<<<(arraysize%256)+1, 100>>> (d_array1, d_array2);

    cudaMemcpy(h_array1, d_array1, sizeof(float)*arraysize, cudaMemcpyDeviceToHost);
    cudaMemcpy(h_array2, d_array2, sizeof(float)*arraysize, cudaMemcpyDeviceToHost);

    for(int i = 0; i < arraysize; i++){
        cout << h_array1[i];
        cout << "\n";
    }

    cout << std::endl;
    return NULL;
}

Thanks, a CUDA newbie.

Upvotes: 0

Views: 976

Answers (1)

Roger Dahl
Roger Dahl

Reputation: 15724

Since you don't initialize the memory, you cannot know if the program actually works. And in fact, it doesn't work, but the failure is hidden because you don't check if your CUDA calls and kernel call are successful.

As you have guessed, your d_arrays are supposed to be allocated from device memory with cudaMalloc(). Then, things would make sense, as the cudaMemcpy() calls would be copying the buffers from cpu to device memory for processing and then copying the result back. Of course, d_array2 does not have to be copied back because it's not modified by the kernel.

Upvotes: 2

Related Questions