Manolete
Manolete

Reputation: 3517

CUDA efficient division?

I would like to know if there is, by any chance an efficient way of dividing elements of an array. I am running with matrix values 10000x10000 and it a considerable amount of time in comparison with other kernels. Division are expensive operations, and I can't see how to improve it.

__global__ void division(int N, float* A, int* B){

  int row = blockIdx.x * blockDim.x + threadIdx.x;
  int col = blockIdx.y * blockDim.y + threadIdx.y;

  if((row < N) && (col <= row) ){
    if( B[row*N+col] >0 )
      A[row*N+col] /= (float)B[row*N+col];
  }

}

kernel launched with

  int N = 10000;
  int threads = 32
  int blocks = (N+threads-1)/threads
  dim3 t(threads,threads);
  dim3 b(blocks, blocks);
  division<<< b, t >>>(N, A, B);
  cudaThreadSynchronize();

Option B:

__global__ void division(int N, float* A, int* B){
  int k =  blockIdx.x * blockDim.x + threadIdx.x;
  int kmax = N*(N+1)/2 
  int i,j;
  if(k< kmax){
    row = (int)(sqrt(0.25+2.0*k)-0.5); 
    col = k - (row*(row+1))>>1;
    if( B[row*N+col] >0 )
      A[row*N+col] /= (float)B[row*N+col];
  }
}

launched with

  int threads =192;
  int totalThreadsNeeded = (N*(N+1)/2;
  int blocks = ( threads + (totalThreadsNeeded)-1 )/threads;
  division<<<blocks, threads >>>(N, A, B);

Why is option B giving a wrong result even if the threadIds are the correct one? what is missing here?

Upvotes: 3

Views: 8726

Answers (3)

talonmies
talonmies

Reputation: 72349

Your basic problem is that you are launching an improbably huge grid (over 100 million threads for your 10000x10000 array example), and then because of the triangular nature of the access pattern in the kernel, fully half of those threads never do anything productive. So a enormous amount of GPU cycles are being wasted for no particularly good reason. Further, the access pattern you are using isn't allowing coalesced memory access, which is going to further reduce the performance of the threads which are actually doing useful work.

If I understand your problem correctly, the kernel is only performing element-wise division on a lower-triangle of a square array. If this is the case, it could be equally done using something like this:

__global__ 
void division(int N, float* A, int* B)
{
    for(int row=blockIdx.x; row<N; row+=gridDim.x) {
        for(int col=threadIdx.x; col<=row; col+=blockDim.x) {
            int val = max(1,B[row*N+col]);
            A[row*N+col] /= (float)val;
        }
    }
}

[disclaimer: written in browser, never compiled, never tested, use at own risk]

Here, a one dimension grid is used, with each block computing a row at a time. Threads in a block move along the row, so memory access is coalesced. In comments you mention your GPU is a Tesla C2050. That device only requires 112 blocks of 192 threads each to completely "fill" each of the 14 SM with a full complement of 8 blocks each and the maximum number of concurrent threads per SM. So the launch parameters could be something like:

int N = 10000;
int threads = 192;
int blocks = min(8*14, N);
division<<<blocks, threads>>>(N, A, B);

I would expect this to run considerably faster than your current approach. If numerical accuracy isn't that important, you can probably achieve further speed-up by replacing the division with an approximate reciprocal intrinsic and a floating point multiply.

Upvotes: 4

tera
tera

Reputation: 7255

The very first thing to look at should be coalesced memory access. There is no reason for the non-coalesced pattern here, just exchange rows and columns for to avoid wasting a lot of memory bandwidth:

int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
...
A[row*N+col] ...

Even if this is run on compute capability 2.0 or higher, the caches are not large enough to remedy this suboptimal pattern.

Upvotes: 2

Roger Dahl
Roger Dahl

Reputation: 15734

Because threads are executed in groups of 32, called warps, you are paying for the division for all 32 threads in a warp if both if conditions are true for just one of the threads. If the condition is false for many threads, see if you can filter out the values for which the division is not needed in a separate kernel.

The int to float conversion may itself be slow. If so, you might be able to generate floats directly in your earlier step, and pass B in as an array of floats.

You may be able to generate inverted numbers in the earlier step, where you generate the B array. If so, you can use multiplication instead of division in this kernel. (a / b == a * 1 / b).

Depending on your algorithm, maybe you can get away with a lower precision division. There's an intrinsic, __fdividef(x, y), that you can try. There is also a compiler flag, -prec-div=false.

Upvotes: 3

Related Questions