ericmoraess
ericmoraess

Reputation: 374

How sum grouped in Cuda

I would like to know what techniques I could apply to add some dimensions of an array and save to a new Vet lower as in the following example:

A -> [1,2], [3,4], [5,6]

B -> [3], [7], [11]

figure: http://snag.gy/83Qwl.jpg

Upvotes: 0

Views: 184

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151889

If you want to write your own CUDA kernel, take a look at the Vector add sample. Instead of passing 2 input vectors to the kernel, you would pass just A and provide a loop to sum over the "rows" of A:

__global__ void mykernel(int *A, int *B, int rows, int cols){
  int idx=threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < rows) {
    int sum = 0;
    for (int i=0; i< cols; i++)
      sum += A[(idx*cols)+i];
    B[idx] = sum;
    }
}

This won't be terribly efficient, but you can improve the efficiency if you can store your A array in column major order:

A -> [1,3,5], [2,4,6] 

then a modification to the above kernel becomes pretty efficient:

__global__ void mykernel(int *A, int *B, int rows, int cols){
  int idx=threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < rows) {
    int sum = 0;
    for (int i=0; i< cols; i++)
      sum += A[(i*cols)+idx];
    B[idx] = sum;
    }
}

If you're looking for efficiency but can't reorganize your data, then a segmented parallel reduction will be fastest. You can try creating something based on the cuda sample codes but I would suggest using thrust, specifically reduce_by_key

You would leave your A array as is and use it as the "values":

A -> [1,2], [3,4], [5,6] 

And you would create a "key" array which corresponds to the rows of your A array:

K -> [0,0], [1,1], [2,2]

Upvotes: 1

Related Questions