cublas is unusually slow compare to cusparse

Question

I'm trying to run some test to compare cusparse and cublas performance under differents sparsity (with a Titan X), here is the main code named "testcusparsevector.cpp" :

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#define M 6
#define N 5
#define IDX2C(i,j,ld) (((j)*(ld))+(i))


// /home/gpu1/Install/OpenBLAS-0.2.14


#define CHECK_EQ(a,b) do { \
    if ((a) != (b)) { \
        cout <<__FILE__<<" : "<< __LINE__<<" : check failed because "< randVec(WIDTH*WIDTHOUT, 0);
            vector secArray(WIDTH*HEIGHT, 0);
            vector temp(WIDTH*HEIGHT, 1);

            for (int j = 0; j < WIDTH*WIDTHOUT; j++) {
                randVec[j]=(double)(rand()%100000)/100;
            }

            for (int x, i = 0; i < num_non_zeros;i++) {
                do
                {
                    x = rand() % (WIDTH*HEIGHT);
                } while(temp[x] == 0);
                temp[x]=0;
                secArray[x]=(double)(rand()%100000)/100;
            }
            int count = 0;
            for(int i=0;i < WIDTH*HEIGHT;i++) {
                if (secArray[i] != 0) {
                    count++;
                }
            }

            // randVec = {2,2,2,3,3,3};
            // secArray = {0,5,0,2,5,8,7,0,0,0,0,2,0,4,4};
            CUDA_CHECK(cudaMemcpy(devPtrRand, &randVec[0], sizeof(double)*WIDTH*WIDTHOUT, cudaMemcpyHostToDevice));
            CUDA_CHECK(cudaMemcpy(devPtrSec, &secArray[0], sizeof(double)*WIDTH*HEIGHT, cudaMemcpyHostToDevice));


            if (SPARSE) {
                CUSPARSE_CHECK(cusparseDnnz(handle, CUSPARSE_DIRECTION_ROW, HEIGHT, WIDTH, descr, devPtrSec, HEIGHT, nnzPerRow_gpu, &nnz));
                CUSPARSE_CHECK(cusparseDdense2csr(handle, HEIGHT, WIDTH, descr,devPtrSec,HEIGHT,nnzPerRow_gpu,csrVal,csrRowPtr,csrColInd));
            }       
            // vector tempcsrVal(nnz,0);
            // vector tempcsrRowPtr(HEIGHT+1);
            // vector tempcsrColInd(nnz,0);
            // CUDA_CHECK(cudaMemcpy(&tempcsrVal[0], csrVal, sizeof(double)*nnz, cudaMemcpyDeviceToHost));
            // CUDA_CHECK(cudaMemcpy(&tempcsrRowPtr[0], csrRowPtr, sizeof(int)*(HEIGHT+1), cudaMemcpyDeviceToHost));
            // CUDA_CHECK(cudaMemcpy(&tempcsrColInd[0], csrColInd, sizeof(int)*nnz, cudaMemcpyDeviceToHost));
            // for (int i =0; i < nnz;i++) {
                // cout << tempcsrVal[i] << " ";
            // }
            // cout << endl;
            // for (int i =0; i < HEIGHT+1;i++) {
                // cout << tempcsrRowPtr[i] << " ";
            // }
            // cout << endl;
            // for (int i =0; i < nnz;i++) {
                // cout << tempcsrColInd[i] << " ";
            // }
            // cout << endl;
            cudaDeviceSynchronize();
            TIC
            for (int i=0 ; i < inside_loop;i++) {
                if (WIDTHOUT == 1) {
                    // TIC
                    CUSPARSE_CHECK(cusparseDcsrmv(handle, CUSPARSE_OPERATION_NON_TRANSPOSE,
                    HEIGHT, WIDTH, nnz, &alpha, descr, csrVal, csrRowPtr, csrColInd, 
                    devPtrRand, &beta, devPtrOutput));
                    // TOC("csrmv")
                } else {
                    // TIC
                    CUSPARSE_CHECK(cusparseDcsrmm(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, 
                        HEIGHT, WIDTHOUT, WIDTH, nnz, &alpha, descr, csrVal, csrRowPtr, 
                        csrColInd, devPtrRand, WIDTH, &beta, devPtrOutput, HEIGHT));
                    // TOC("csrmm")
                }
            }
            TOC("csr")
            TIC
            for (int i=0 ; i < inside_loop;i++) {
                if (WIDTHOUT == 1) {
                    // TIC
                    CUBLAS_CHECK(cublasDgemv(handleblas, CUBLAS_OP_N, HEIGHT, WIDTH, &alpha, devPtrSec, HEIGHT , devPtrRand, 1, &beta, devPtrOutput2, 1));
                    // TOC("dgemv")
                } else {
                    // TIC
                    CUBLAS_CHECK(cublasDgemm(handleblas, CUBLAS_OP_N, CUBLAS_OP_N, HEIGHT, WIDTHOUT, WIDTH, &alpha, devPtrSec, HEIGHT, devPtrRand, WIDTH, &beta, devPtrOutput2, HEIGHT));
                    // TOC("dgemm")
                }
            }
            TOC("blas")


            #if 0
            vector output(HEIGHT*WIDTHOUT, 0);
            vector output2(HEIGHT*WIDTHOUT, 0);
            CUDA_CHECK(cudaMemcpy(&output[0], devPtrOutput, sizeof(double)*HEIGHT*WIDTHOUT, cudaMemcpyDeviceToHost));
            CUDA_CHECK(cudaMemcpy(&output2[0], devPtrOutput2, sizeof(double)*HEIGHT*WIDTHOUT, cudaMemcpyDeviceToHost));
            dispArray(&output[0], WIDTHOUT, HEIGHT);
            cout << endl;
            for (int i=0;i < WIDTHOUT * HEIGHT;i++) {
                if (output[i] != output2[i]) {
                    cout << "error: " << i << " " << (output[i] - output2[i]) << " " << output[i] << endl;
                }
            }
            #endif

        }

        cout << DENSITY << " " << sum/num_loop << endl;
        myfile << DENSITY << " " << sum/num_loop << endl;
        cudaFree(csrColInd);
        cudaFree(csrVal);
    }
    myfile.close();
    cudaFree(csrRowPtr);
    cudaFree(devPtrOutput);
    cudaFree(devPtrRand);
    cudaFree(devPtrSec);

}

However after compiling the code with

g++ -std=c++1y -O3 -I/usr/local/cuda/include -o testcusparsevector testcusparsevector.cpp -L/usr/local/cuda/lib64 -lcudart -lcublas -lcusparse

here is the output :

GPU Execution time of CUDA_csr: 4818.447266 ms
GPU Execution time of CUDA_blas: 5024.459961 ms

which should mean that even if my density is at 0.999, the cusparseDcsrmm is still faster than cublasDgemm, I already checked the result which is good, and compared to others example, seems that the problem is coming from cublas which is far too slow.

Do you have any idea about where does it come from ?

EDIT : I tried to change the values to float, and the result is more what I was looking for, apparently, cublas is not made for double computation...

Thanks by advance.

Robert Crovella · Accepted Answer

Titan X (and all current members of the maxwell GPU family) have a ratio of throughput between double precision floating point operations and single precision floating point operations of 1:32.

Normally, sparse matrix operations are memory bandwidth bound, whereas dense matrix-matrix multiply would be an example of a compute-bound problem.

So in your example, you're taking a problem that is typically compute bound, and running it as a sparse matrix multiply on a processor which has a relatively large amount of memory bandwidth, and a relatively small amount of double-precision compute throughput.

The situation can give rise to a blurring of the lines between the two APIs, whereas the CUBLAS API would normally be much quicker for this comparison.

If you switch your code to using float instead of double as I think you've already tried, you'll see CUBLAS win again. Likewise if you ran the code as-is on a GPU that had a different ratio between single and double precision throughput, you'd see CUBLAS win again there also.

apparently, cublas is not made for double computation...

rather than saying that, I would say that GTX Titan X is not made (primarily) for double computation. Try a Tesla K80, K40, or other GPU that has a closer ratio of double to single throughput.

Here's the output of your program running on a "unboosted" Tesla K40:

$ ./testcusparsevector
GPU Execution time of CUDA_csr: 8870.386719 ms
GPU Execution time of CUDA_blas: 1045.211792 ms

Disclaimer: I haven't attempted to study your code. I looked it over, and no obvious issues jump out at me. But there could be issues that I haven't spotted.

cublas is unusually slow compare to cusparse

Answers (1)

Related Questions