Nice Nicest
Nice Nicest

Reputation: 1

Indexing scheme vs. array of pointers in CUDA

I want to add two 2D arrays in CUDA.

When I use indexing scheme the program works fine:

#define COLUMNS 3
#define ROWS 2    
__global__ void add(int *a, int *b, int *c)
{
    int x = blockIdx.x;
    int y = blockIdx.y;
    int i = (COLUMNS*y) + x;
    c[i] = a[i] + b[i];
}    

int main()
{
    int a[ROWS][COLUMNS], b[ROWS][COLUMNS], c[ROWS][COLUMNS];
    int *dev_a, *dev_b, *dev_c;
    cudaMalloc((void **) &dev_a, ROWS*COLUMNS*sizeof(int));
    cudaMalloc((void **) &dev_b, ROWS*COLUMNS*sizeof(int));
    cudaMalloc((void **) &dev_c, ROWS*COLUMNS*sizeof(int));
    for (int y = 0; y < ROWS; y++) // Fill Arrays
        for (int x = 0; x < COLUMNS; x++)
        {
            a[y][x] = x;
            b[y][x] = y;
        }
    cudaMemcpy(dev_a, a, ROWS*COLUMNS*sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, ROWS*COLUMNS*sizeof(int), cudaMemcpyHostToDevice);
    dim3 grid(COLUMNS,ROWS);
    add<<<grid,1>>>(dev_a, dev_b, dev_c);
    cudaMemcpy(c, dev_c, ROWS*COLUMNS*sizeof(int), cudaMemcpyDeviceToHost);
    return 0;
}

However it does not work when the matrix b is represented as an array of pointers rather then by indexing scheme as above:

int a[ROWS][COLUMNS], **b, c[ROWS][COLUMNS];
int *dev_a, *dev_b, *dev_c;
b = (int**)malloc(ROWS*sizeof(int*));
for (int i = 0; i < ROWS; i++)
b[i] = (int*) malloc(COLUMNS*sizeof(int));

Why?

I used example from here: http://www.math.uaa.alaska.edu/~afkjm/cs448/handouts/cuda-firstprograms.pdf

Upvotes: 0

Views: 333

Answers (1)

Pixelchemist
Pixelchemist

Reputation: 24926

Eventhough the question is old I will try to give a hint to everyone coming here having related problems.

I agree with leftaroundabout. The compile-time-constant allocation without malloc

int a[ROWS][COLUMNS], b[ROWS][COLUMNS], c[ROWS][COLUMNS];

most likely allocates a single block of Memory for each variable (a, b, c) that is ROWS*COLUMNS in size. One can copy that block to the device in one single memcpy Operation.

In the pointer to pointer case each column is allocated seperately and thus we are not guranteed to have one contiguous block of Memory which can be copied to the device using a single memcpy.

In the given **b case, one will have to copy the data of each Column seperately, in order to achieve the analogoous behaviour.

for (int o=0; o<ROWS; ++o)
{
  cudaMemcpy((dev_b+o*COLUMNS), b[o], COLUMNS*sizeof(int), cudaMemcpyHostToDevice);
}

Nevertheless i think one should handle a and b using the same scheme to avoid index confusion in case of different ordering of the elements. (Although i allege that a a[ROWS][COLUMNS] variable will be stored using row-major order.)

Upvotes: 1

Related Questions