user2970139
user2970139

Reputation: 559

OpenCL: Optimize matrix multiplication for uchar

I adapted the attached kernel from one of the NVIDIA OpenCL examples and compared performance to clblasSgemm, and found that they perform equally fast (at least on my setup). I am launching it with a {16, 16} local work size.

Now, assume matrices A and B are both uchar, and C accordingly uint. Is there any way to optimize the multiplication? Simply replacing the types degraded performance. I tried hand-vectorizing with uchar4 and uchar16, but that made it slower.

Any suggestions welcome! (I am new to GPU programming and OpenCL)

/*
 * This software contains source code provided by NVIDIA Corporation.
 */

#define BLOCK_SIZE 16

__kernel void mat_mul(const __global float* A, const __global float* B,
                    __global float* C,
                    const int A_cols, const int B_cols) {
// Block index
const int bx = get_group_id(0);
const int by = get_group_id(1);

// Thread index
const int tx = get_local_id(0);
const int ty = get_local_id(1);

// Index of the first sub-matrix of A processed by the block
const int a0 = A_cols * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block
const int a1 = a0 + A_cols - 1;
const int a_step = BLOCK_SIZE;

// Index of the first sub-matrix of B processed by the block
const int b0 = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B
const int b_step = BLOCK_SIZE * B_cols;

// Csub is used to store the element of the block sub-matrix
// that is computed by the thread
float Csub = 0;

__local float As[BLOCK_SIZE][BLOCK_SIZE];
__local float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Loop over all the sub-matrices of A and B required to compute the
// block sub-matrix
for (int a=a0, b=b0; a<=a1; a+=a_step, b+=b_step) {
  // Load the matrices from device memory to shared memory;
  // each thread loads one element of each matrix
  As[ty][tx] = A[a + A_cols * ty + tx];
  Bs[ty][tx] = B[b + B_cols * ty + tx];

  // Synchronize to make sure the matrices are loaded
  barrier(CLK_LOCAL_MEM_FENCE);

  // Multiply the two matrices together;
  // each thread computes one element of the block sub-matrix
  #pragma unroll
  for (int k=0; k<BLOCK_SIZE; ++k) {
    Csub += As[ty][k] * Bs[k][tx];
  }

  // Synchronize to make sure that the preceding computation is done
  // before loading two new sub-matrices of A and B in the next
  // iteration
  barrier(CLK_LOCAL_MEM_FENCE);
}

// Write the block sub-matrix to device memory;
// each thread writes one element
C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = Csub;
}

Upvotes: 0

Views: 534

Answers (1)

Roman Arzumanyan
Roman Arzumanyan

Reputation: 1814

There is very simple way to measure if your kernel is good. Calculate it's OPS & bandwidth (how many data in form of matrix are you processing per second). Then compare it to theoretical limits. You will get factor, limiting performance. Usually, it's load-store operations.

Upvotes: 1

Related Questions