qtqt
qtqt

Reputation: 63

Strange cuBLAS gemm batched performance

I am noticing some strange performance of cublasSgemmStridedBatched, and I am looking for a explaination. The matrix size is fixed at 20x20. Here are some timings (only the multiply, no data transfer) for a few different batch sizes:

First few batch sizes do as I would expect, as the batch size increases by a factor of ten, the time increases linearly. However, using 100,000 matrices all the sudden a 3.4X speedup happens?

If the matrix size is fixed at 10x10 and the trial is executed again I discover:

Again, a surprise speed up of 22X happens at 100,000 batch size? Makes me wonder why batch sizes of 1,000 and 10,000 are slower than batch size 100,000, because the matrix size remains 10x10.

Are different algorithms used for different batch sizes? This performance I find quite strange. When I conduct this trial with cublasSgemmBatched, similar results happen. These trials are executed on a GeForce GTX 1080 Ti. A minimal working code is bestowed:

#include <stdio.h>
#include <stdlib.h>
#include "math.h"
#include "cublas_v2.h" 
//nvcc -lcublas cublas.c -o cublas.out

int main(int argc, char* argv[])
{
int i,j,k,index;

// Linear dimension of matrices
int dim = 20;
int batch_count = 10*10*10*10*10*1;
// Allocate host storage for batch_count A,B,C square matrices
float* h_A = malloc(sizeof(float) * dim * dim * batch_count);
float* h_B = malloc(sizeof(float) * dim * dim * batch_count);
float* h_C = malloc(sizeof(float) * dim * dim * batch_count);
    for(k=0; k<batch_count; k++) {
        for(j=0; j<dim; j++) {
                for(i=0; i<dim; i++) {
                index = i*dim + j + k*dim*dim;
                  h_A[index] = index*index + 0.0f;
                  h_B[index] = index + 1.0f;
                  h_C[index] = 0.0f;
        }
    }
}


float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, sizeof(float) * dim * dim * batch_count);
cudaMalloc(&d_B, sizeof(float) * dim * dim * batch_count);
cudaMalloc(&d_C, sizeof(float) * dim * dim * batch_count);
cudaMemcpy(h_A,d_A,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);
cudaMemcpy(h_B,d_B,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);
cudaMemcpy(h_C,d_C,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);

cublasHandle_t handle;
cublasCreate(&handle);

// Do the actual multiplication 
float time_cuda_event;
cudaEvent_t start, stop;    
cudaEventCreate(&start);
cudaEventCreate(&stop) ;
cudaEventRecord(start, 0);
float alpha = 1.0f;  float beta = 1.0f;
cublasSgemmStridedBatched(handle,
                              CUBLAS_OP_N, 
                              CUBLAS_OP_N,
                              dim, dim, dim,
                              &alpha,
                              (const float*)d_A, dim,
                              dim*dim,
                              (const float*)d_B, dim,
                              dim*dim,
                              &beta,
                              d_C, dim, 
                              dim*dim, 
                              batch_count);
( cudaEventRecord(stop, 0) );
( cudaEventSynchronize(stop) );
( cudaEventElapsedTime(&time_cuda_event, start, stop) );              
printf("Time :  %3.1f ms \n", time_cuda_event);  

cudaMemcpy(h_C,d_C,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);
// Destroy the handle
cublasDestroy(handle);


cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);
    return 0;
}

Upvotes: 4

Views: 2986

Answers (1)

talonmies
talonmies

Reputation: 72372

This appears to just be the result of heuristics within CUBLAS. If I run a modified (and working) version of your code I get these timings for a 5x5 case:

Batch size :           10   Time :  0.019104 ms 
Batch size :          100   Time :  0.038304 ms 
Batch size :         1000   Time :  0.163520 ms 
Batch size :        10000   Time :  1.410944 ms 
Batch size :       100000   Time :  1.614144 ms 
Batch size :      1000000   Time :  16.057407 ms 

Profiling shows that at cases up to batches with 10000 entries, the library runs one kernel:

1.10759s  16.831us             (1 1 10)       (128 1 1)       120  12.250KB        0B         -           -           -           -  GeForce GTX 970         1         7  maxwell_sgemm_128x64_nn [3939]
1.10766s  19.168us            (1 1 100)       (128 1 1)       120  12.250KB        0B         -           -           -           -  GeForce GTX 970         1         7  maxwell_sgemm_128x64_nn [3971]
1.10773s  147.71us           (1 1 1000)       (128 1 1)       120  12.250KB        0B         -           -           -           -  GeForce GTX 970         1         7  maxwell_sgemm_128x64_nn [4003]
1.10791s  1.4064ms          (1 1 10000)       (128 1 1)       120  12.250KB        0B         -           -           -           -  GeForce GTX 970         1         7  maxwell_sgemm_128x64_nn [4035]

while at larger sizes it runs multiple calls to another kernel to service the call:

1.10935s  1.1518ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4063]
1.11050s  606.54us          (1 1 34465)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4087]
1.11113s  1.1498ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4115]
1.11228s  1.1501ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4139]
1.11344s  1.1511ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4163]
1.11459s  1.1494ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4187]
1.11574s  1.1507ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4211]
1.11689s  1.1503ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4235]
1.11804s  1.1499ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4259]
1.11919s  1.1507ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4283]
1.12035s  1.1507ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4307]
1.12150s  1.1509ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4331]
1.12265s  1.1489ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4355]
1.12380s  1.1496ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4379]
1.12495s  1.1500ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4403]
1.12610s  1.1494ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4427]
1.12726s  1.1503ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4451]
1.12841s  299.35us          (1 1 16975)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4475]

The inconsistency you have observed seems to be caused by the change from one kernel to the other within the library, which is probably made by some batch size criteria. You can see that both kernels seem to use one block per batch item, with the kernel used at larger sizes using a 2D block with 256 threads, whereas the smaller size kernel uses a 1D block with 128 threads. Beyond that, the performance differences are down to internal implementation details. Even though it is probably a violation of the End User Licence to do so, if you want to understand more you will need to disassemble the kernels and look at how they work. The toolkit contains all the tools required to do this, although I am not suggesting you do so.

Upvotes: 4

Related Questions