rex
rex

Reputation: 3183

How to implement a basic C++ 2D array loop in CUDA

I'm pretty new to C++ coding and I'm currently trying to use CUDA to do some GPU computations.

Basically I have a matrix A (N by N), and a couple of vectors b and x0. b and x0 have N elements as well.

This is the piece of code I'm trying to implement:

for (unsigned i=1;i<=N;i++){
    T sum = 0;
    for (unsigned j=1;j<=N;j++){
        sum += A[j][i]*x0[j];
    }
    v[i] = b[i] - sum;
}

where T is a template variable (can be assigned a double as far as I'm aware).

Would it be possible to parallelize the whole thing, and if so how would I to that? I could also use some pointers regarding how to break up the threads in general for such a problem into blocks and how to move the 2D from the host to device and back...

Please let me know if any additional information is required.

EDIT 1: After looking into CUBLAS and not getting very far, Ive decided to flatten my matrices and write the code myself. My first discovery has been that my cuda kernel doesn't like working with double type variables/arrays [can someone confirm this?].

After converting everything to a float the cuda kernel I've written looks something like this:

__global__ void cudaMatTimesVect(float *p, float  *x0, float *v, float *sum, float *toSum, float *b, int N){

int idx = blockIdx.x * blockDim.x + threadIdx.x; // thread index

if (idx < N*N){
    toSum[idx] = p[idx] * x0[blockIdx.x];
}

__syncthreads();
if( idx-(blockIdx.x * blockDim.x) == 0){
    for(int i=0; i<blockDim.x; i++){
        sum[blockIdx.x] += toSum[idx+i];
    }

v[blockIdx.x] = b[blockIdx.x] - sum[blockIdx.x];
}

I'm not sure whether the syncthreads() command will wait for all the threads to multiply before attempting to carry out the sum loop.

Here is the snippet of CPU code regarding he sum and toSum arrays initialized only on the GPU:

float *d_sum;
float *d_toSum;
cudaError_t  cudaStatus;
...
// allocate toSum memory
cudaStatus = cudaMalloc(&d_toSum, N*N*sizeof(float));
if (cudaStatus != cudaSuccess){
    std::cout << "couldnt allocate device memory for d_toSum!" << std::endl;
    cudaFree(d_toSum);
}
// allocate sum mem on device
cudaStatus = cudaMalloc(&d_sum, N*sizeof(float));
if (cudaStatus != cudaSuccess){
    std::cout << "couldnt allocate device memory for d_sum" << std::endl;
    cudaFree(d_sum);
}

... 
...
// call the kernel
cudaMatTimesVect<<<N,N>>>(d_p, d_x0, d_v, d_sum, d_toSum, d_b, N);
...


cudaFree(d_toSum);
cudaFree(d_sum);

Is this the most efficient way to do the summation?

EDIT 2: I have now changed the code to use different block indcies to run row computations. The above kernel compiles and runs, but array elements in v seem to keep getting smaller and smaller rather than restarting...

I am still interested to understand why I can't use doubles and how my code needs to change if I want to define my host arrays using < vector >.

Thanks,

Armen

Upvotes: 0

Views: 2237

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151809

You can solve this problem in cublas:

Data is copied to the GPU using cublasSetVector or cublasSetMatrix

Results are copied back using the corresponding Get functions.

The matrix-vector multiply is handled with gemv. The vector-vector subtract is handled with axpy.

A worked cublas example is available in the cuda samples.

Based on the additional comments: There's no reason to segment the data into 1D blocks for this problem. I'm recommending cublas. But if you want to see other code examples, take a look at the vector add example and the matrix multiply example.

For the doubly-subscripted matrix on the host, you should flatten that so that you can reference the data with a single (*) pointer and indexing. This is true whether you are using cublas or writing your own code.

EDIT: responding to updates in the question. The multiplication code you've posted doesn't look like matrix-vector multiplication to me, unless you've duplicated your vector N times in length so that it matches the length of the matrix (NxN). Then it would appear to be correct.

The summing code doesn't look correct, and furthermore, since it doesn't depend on idx in any way, all your threads are doing exactly the same thing. So no parallel benefit there, and we don't normally write GPU code that way.

Your vector subtract code appears to be approximately correct, except again that you seem to be doing a vector subtract across the entire length of the matrix (NxN) when the results of the matrix-vector multiply should only have produced a vector of length N.

I'd be surprised if this code could produce results that matched your serial code for the same data sets. Did you check that it produced correct results for non-trivial data sets? (Don't use data sets where every number is the same.)

Upvotes: 1

Related Questions