Arya Mz
Arya Mz

Reputation: 591

Unordinary performance gap between OpenCL and CUDA

I have coded a simple tiled matrix multiplication in CUDA. It's like this:

__global__ void matrixMultiplyShared(float * A, float * B, float * C,
                         int numARows, int numAColumns,
                         int numBRows, int numBColumns,
                         int numCRows, int numCColumns) {

    __shared__ float ds_A[TILE_WIDTH][TILE_WIDTH];
    __shared__ float ds_B[TILE_WIDTH][TILE_WIDTH];

    int bx = blockIdx.x; int by = blockIdx.y;
    int tx = threadIdx.x; int ty = threadIdx.y;

    int row = by * TILE_WIDTH + ty;
    int col = bx * TILE_WIDTH + tx;

    float Cvalue = 0.0;

// Loop over the M and N tiles required to compute the Pd element
    for (int m = 0; m < (numAColumns-1)/TILE_WIDTH+1; ++m) {
        if(row<numARows && m*TILE_WIDTH+tx < numAColumns){
            ds_A[ty][tx] = A[row*numAColumns + m*TILE_WIDTH+tx];
        } else {
            ds_A[ty][tx] = 0;
        }
        if(m*TILE_WIDTH+ty < numBRows && col < numBColumns){
            ds_B[ty][tx] = B[(m*TILE_WIDTH+ty)*numBColumns+col];
        } else {
            ds_B[ty][tx] = 0;
        }
        __syncthreads();
        if(row < numCRows && col < numCColumns){
            for (int k = 0; k < TILE_WIDTH; ++k)
                Cvalue += ds_A[ty][k] * ds_B[k][tx];
        }
        __syncthreads();
    }
    if(row < numCRows && col < numCColumns)
        C[row*numCColumns+col] = Cvalue;
}

After that, I used the same above kernel (with some minor changes) in the OpenCL version to compare the performance of CUDA and OpenCL together. But the result was to so far beyond my expectations. OpenCL was 6-7 times faster than CUDA. Is it valid? The output of Nisght is as follows:

CUDA: CUDA Nisght output: Kernel Ex time: 3.78s

OpenCL: CUDA Nisght output: Kernel Ex time: 0.53s

You can see a large gap between starting the app and executing the kernel. why is that happened?


My GPU is: GTX 580 | The Kernel Ex time (CUDA): 3.78s | The Kernel Ex time (OpenCL): 0.53s |

CUDA Code: http://pastebin.com/VQMp3Hba

OpenCL Host Code: http://pastebin.com/cjGYSLQf

OpenCL Kernel Code: http://pastebin.com/KKw3Ayz7

Upvotes: 2

Views: 592

Answers (1)

Rahul
Rahul

Reputation: 11

You can try and insert explicit timers in the code instead of trusting the output from the tool. May be the case that the tool is wrong.

Upvotes: 1

Related Questions