Bernardo
Bernardo

Reputation: 541

CUDA theoretical bandwidth vs effective bandwidth

I have a CUDA kernel that multiplies two matrices which Width and Height are multiples of the blocksize i am using.

The Nvidia Quadro Fx 3800 I am using has a theoretical bandwidth of 50 Gb/s and I am having some strange results(Effective Bandwidth larger than Theoretical Bandwidth)

I will post here some results:

With Blocksize 2

[10][10] * [10][10] -> BW=0,02 Gb/s [1000][1000]*[1000][1000] -> BW=69,4 Gb/s

With Blocksize 64

[1000][1000] * [1000][1000] -> BW=486,4 Gb/s [10000][10000] * [10000][10000] -> BW= 45072,12 Gb/s

I took the effective bandwidth formula from the Nvidia Best Practices Guide(I have simplified it but its equivalent(unless there is a stupid mistake)). I think the kernel is fine as its very similar(if not equal) to some Nvidia Lectures I read and also because its working properly(afaik).

#define blocksize 64
#define HM (10000) 
#define WM (10000) 
#define WN (10000)
#define HN WM 
#define WP WN   
#define HP HM  
#define PTH WM
#define PTW HM

__global__ void nonsquare(float*M, float*N, float*P, int uWM,int uWN)
   {
__shared__ float MS[blocksize][blocksize];
__shared__ float NS[blocksize][blocksize];

int tx=threadIdx.x, ty=threadIdx.y, bx=blockIdx.x, by=blockIdx.y;
int rowM=ty+by*blocksize;
int colN=tx+bx*blocksize;
int Pvalue=0;

for(int m=0; m< uWM/blocksize;m++){
    MS[ty][tx]=M[rowM*uWM+(m*blocksize+tx)];
    NS[ty][tx]=M[colN + uWN*(m*blocksize+ty)];
    __syncthreads();
    for(int k=0;k<blocksize;k++)
        Pvalue+=MS[ty][k]*NS[k][tx];
    P[rowM*WP+colN]=Pvalue;
}

}
int main(){


cudaEvent_t evstart, evstop;
cudaEventCreate(&evstart);
cudaEventCreate(&evstop);

float*M=(float*)malloc(sizeof(float)*HM*WM);
float*N=(float*)malloc(sizeof(float)*HN*WN);

for(int i=0;i<WM*HM;i++)
    M[i]=(float)i;
for(int i=0;i<WN*HN;i++)
    N[i]=(float)i;




float*P=(float*)malloc(sizeof(float)*HP*WP);



float *Md,*Nd,*Pd;
cudaMalloc((void**)&Md,HM*WM*sizeof(float));

cudaMalloc((void**)&Nd,HN*WN*sizeof(float));

cudaMalloc((void**)&Pd,HP*WP*sizeof(float));



cudaMemcpy(Md,M,HM*WM*sizeof(float),cudaMemcpyHostToDevice);

cudaMemcpy(Nd,N,HN*WN*sizeof(float),cudaMemcpyHostToDevice);



dim3 dimBlock(blocksize,blocksize);//(tile_width , tile_width);
dim3 dimGrid(WN/dimBlock.x,HM/dimBlock.y);//(width/tile_width , width/tile_witdh);

cudaEventRecord(evstart,0);

nonsquare<<<dimGrid,dimBlock>>>(Md,Nd,Pd,WM,WN);

cudaEventRecord(evstop,0);
cudaEventSynchronize(evstop);
float time;
cudaEventElapsedTime(&time,evstart,evstop);

cudaMemcpy(P,Pd,WP*HP*sizeof(float),cudaMemcpyDeviceToHost);

    cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);


    printf("\ntime spent:%f",time);
float Bandwidth=(HM*WM*4+WN*HN*4+HP*WP*4)/(time*1000000); /
printf("\nEffective Bandwidth:%f Gb/s\n",Bandwidth);
    }

Thanks in advance

Upvotes: 2

Views: 1237

Answers (3)

fabmilo
fabmilo

Reputation: 48330

I think the kernel is just silently failing.

  1. Did you check for any errors after the kernel invocation ?

  2. Does the code work ?

  3. What results do you have on the timings ?

Upvotes: 2

peakxu
peakxu

Reputation: 6675

Note that by using shared memory, texture memory, etc., it is sometimes possible to exceed theoretical bandwidth. That often means you're tapping into some dedicated hardware supported functions (such as built-in bilinear texture interpolation, etc.), perhaps unintentionally.

Besides the reasons that Robert Harvey mentioned, there's also potentially factory overclock of cards by vendors (albeit more common for GeForce than Quadros).

Overall, I'd say that you're doing well if you get close to or exceed the theoretical bandwidth (either in memory or compute).

Upvotes: 1

Robert Harvey
Robert Harvey

Reputation: 180878

I can think of a number of explanations:

  1. Changes to the baseline code that adversely affect the measurements
  2. Invalid performance assumptions
  3. Unidentified micro-optimizations.
  4. Unrealistic benchmarks.

You say your code is simplified. I would try using the original benchmark code, and see what happens. If the numbers are more realistic, you can compare the original benchmark code with your simplified code to identify the differences.

Upvotes: 0

Related Questions