sum vectors values with cuda C++

Question

I try to sum many vectors values using CUDA c++. I found some solution for two vectors. As you can see, just possible to add two vectors but I wanna generate vectors dynamically with the same length.

#include 
#include 
#include 

// CUDA kernel. Each thread takes care of one element of c
 __global__ void vecAdd(double *a, double *b, double *c, int n)
{
// Get our global thread ID
int id = blockIdx.x*blockDim.x+threadIdx.x;
// Make sure we do not go out of bounds
if (id < n)
    c[id] = a[id] + b[id];
}

int main( int argc, char* argv[] )
{
// Size of vectors
int n = 100000;

// Host input vectors
double *h_a;
double *h_b;
//Host output vector
double *h_c;

// Device input vectors
double *d_a;
double *d_b;
//Device output vector
double *d_c;

// Size, in bytes, of each vector
size_t bytes = n*sizeof(double);

// Allocate memory for each vector on host
h_a = (double*)malloc(bytes);
h_b = (double*)malloc(bytes);
h_c = (double*)malloc(bytes);

// Allocate memory for each vector on GPU
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);

int i;
// Initialize vectors on host
for( i = 0; i < n; i++ ) {
    h_a[i] = sin(i)*sin(i);
    h_b[i] = cos(i)*cos(i);
}

// Copy host vectors to device
cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

int blockSize, gridSize;

// Number of threads in each thread block
blockSize = 1024;

// Number of thread blocks in grid
gridSize = (int)ceil((float)n/blockSize);

// Execute the kernel
vecAdd<<>>(d_a, d_b, d_c, n);

// Copy array back to host
cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

// Sum up vector c and the print result divided by n, this should equal 1 
within error
double sum = 0;
for(i=0; i



Is there a way to do this for many vectors? My vectors size are:

#vector length
N = 1000 
#number of vectors
i = 300000
v[i] = [1,2,..., N]


As result i need to get:

out[i]= [sum(v[1]), sum(v[2]),..., sum(v[i])]


Thanks for any advice.

Robert Crovella · Accepted Answer

Summing multiple vectors together in a fashion similar to the code you have shown (i.e. generating elementwise sums) is equivalent to summing the columns of a matrix. And this idea represents a sensible way to realize the solution.

We will treat your vectors as a matrix, where each vector is a row in the matrix. The CUDA kernel will assign one thread to each column, and will sum the elements of that column, producing a single number result. That single number result will become one element of the vector result of the entire problem.

Here is a fully worked example demonstrating one possible approach:

$ cat t2.cu
#include 

typedef double mt;
const int nTPB = 64;

template 
__global__ void column_sum(T *matrix, T *sums, unsigned n_vectors, unsigned vector_length){

  unsigned idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < vector_length){
    T temp = 0;
    for (unsigned i = 0; i < n_vectors; i++)
      temp += matrix[i*vector_length+idx];
    sums[idx] = temp;}
}

int main(){
  const unsigned vlen = 1000;
  const unsigned nvec = 300000;
  mt *h_matrix, *d_matrix, *h_sums, *d_sums;
  // create the desired number of vectors as a single matrix
  h_sums = new mt[vlen];
  h_matrix = new mt[vlen*nvec];
  cudaMalloc(&d_matrix, vlen*nvec*sizeof(mt));
  cudaMalloc(&d_sums, vlen*sizeof(mt));
  size_t count = 0;
  for (unsigned i = 0; i < nvec; i++)
    for (unsigned j = 0; j < vlen; j++)
      h_matrix[count++] = j;
  cudaMemcpy(d_matrix, h_matrix, vlen*nvec*sizeof(mt), cudaMemcpyHostToDevice);
  column_sum<<<(vlen+nTPB-1)/nTPB,nTPB>>>(d_matrix, d_sums, nvec, vlen);
  cudaMemcpy(h_sums, d_sums, vlen*sizeof(mt), cudaMemcpyDeviceToHost);
  for (unsigned i = 0; i < vlen; i++) if (h_sums[i] != ((mt)nvec)*i) {std::cout << " mismatch at " << i << " was: " << h_sums[i] << " should be: " << ((mt)nvec)*i << std::endl; return -1;}
  std::cout << cudaGetErrorString(cudaGetLastError()) << std::endl;
}
$ nvcc -o t2 t2.cu
$ cuda-memcheck ./t2
========= CUDA-MEMCHECK
no error
========= ERROR SUMMARY: 0 errors
$

Note that this methodology only creates as many threads on the GPU as there are vector elements (1000 in the above example). 1000 threads would be enough to keep only the smallest GPUs busy. However this algorithm will be efficient on most GPUs if your vector length is 10,000 or longer. If you'd like to explore creating more efficient algorithms for small problem sizes, you can study the idea of a classical parallel reduction.

sum vectors values with cuda C++

Answers (1)

Related Questions