Scabro93
Scabro93

Reputation: 37

Better timing in C++ in than Cuda

I'm having trouble writing my program in CUDA. The program I'm doing is an encryption which performs a multiplication of a matrix by a vector and gives me a result depending on my vector introduced. The problem is that I am taking time in both C++ and CUDA and gives me a better result in C++ than CUDA. What I did was to make a loop, because I require several keys for encryption, the code is as follows:

t1 = clock();
do {

    HANDLE_ERROR ( cudaMemcpy(MAT_dev, MAT, nBytes, cudaMemcpyHostToDevice) );
    HANDLE_ERROR ( cudaMemcpy(VEC_dev, VEC, nBytes, cudaMemcpyHostToDevice) );

    mult<<< 1, b >>>(MAT_dev, VEC_dev, SOL_dev, b);

    HANDLE_ERROR ( cudaMemcpy(SOL, SOL_dev, nBytes, cudaMemcpyDeviceToHost) );

    for (i = 0; i < b; i++) {
        cout << SOL[i] << " ";
    }
    cout << endl;

    for (i = 0; i < b; i++) {
        VEC[i] = SOL[i];
    }

    cont = cont + 1;

} while (cont < w);
t2 = clock();

My results :

C++ : 11.474 minutes

CUDA : 40.464 minutes

The number of keys were 1,000,000. Matrix 7 x 7 and a Vector 7.

Do not know if it's ok or I'm missing something to make it faster.

Thanks for your help.

Upvotes: 1

Views: 154

Answers (1)

kangshiyin
kangshiyin

Reputation: 9781

Possible problems of your code:

  1. spending most of the time on cudaMemcpy() and cout<<
  2. speed may be limited by the grid/block size. Generally speaking, # blocks in a grid should be >= # stream processes to fully utilize the GPU hardware; # threads in a block should be at least 64 and always be multipe of warp size.
  3. matrix/vector size too small to achieve good scalability

Possible solutions:

  1. Instead of doing 1,000,000 m_{7x7} * v_{7}, try to do 1 m_{7,000,000x7} * v_{7};
  2. try to merge 1,000,000 cudaMemcpy() into 1;
  3. Use cudaMallocPitch() to alloc memory for small matrices, which relax the aligment problem;
  4. try to use cublas_gemv() provided in cublas library if the element type of your matrix/vector is double/float

You may wish to read the CUDA C programming guide & C best practices guide before writing your own kernels

Upvotes: 3

Related Questions