Reputation: 63
I am noticing some strange performance of cublasSgemmStridedBatched, and I am looking for a explaination. The matrix size is fixed at 20x20. Here are some timings (only the multiply, no data transfer) for a few different batch sizes:
First few batch sizes do as I would expect, as the batch size increases by a factor of ten, the time increases linearly. However, using 100,000 matrices all the sudden a 3.4X speedup happens?
If the matrix size is fixed at 10x10 and the trial is executed again I discover:
Again, a surprise speed up of 22X happens at 100,000 batch size? Makes me wonder why batch sizes of 1,000 and 10,000 are slower than batch size 100,000, because the matrix size remains 10x10.
Are different algorithms used for different batch sizes? This performance I find quite strange. When I conduct this trial with cublasSgemmBatched, similar results happen. These trials are executed on a GeForce GTX 1080 Ti. A minimal working code is bestowed:
#include <stdio.h>
#include <stdlib.h>
#include "math.h"
#include "cublas_v2.h"
//nvcc -lcublas cublas.c -o cublas.out
int main(int argc, char* argv[])
{
int i,j,k,index;
// Linear dimension of matrices
int dim = 20;
int batch_count = 10*10*10*10*10*1;
// Allocate host storage for batch_count A,B,C square matrices
float* h_A = malloc(sizeof(float) * dim * dim * batch_count);
float* h_B = malloc(sizeof(float) * dim * dim * batch_count);
float* h_C = malloc(sizeof(float) * dim * dim * batch_count);
for(k=0; k<batch_count; k++) {
for(j=0; j<dim; j++) {
for(i=0; i<dim; i++) {
index = i*dim + j + k*dim*dim;
h_A[index] = index*index + 0.0f;
h_B[index] = index + 1.0f;
h_C[index] = 0.0f;
}
}
}
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, sizeof(float) * dim * dim * batch_count);
cudaMalloc(&d_B, sizeof(float) * dim * dim * batch_count);
cudaMalloc(&d_C, sizeof(float) * dim * dim * batch_count);
cudaMemcpy(h_A,d_A,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);
cudaMemcpy(h_B,d_B,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);
cudaMemcpy(h_C,d_C,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);
cublasHandle_t handle;
cublasCreate(&handle);
// Do the actual multiplication
float time_cuda_event;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop) ;
cudaEventRecord(start, 0);
float alpha = 1.0f; float beta = 1.0f;
cublasSgemmStridedBatched(handle,
CUBLAS_OP_N,
CUBLAS_OP_N,
dim, dim, dim,
&alpha,
(const float*)d_A, dim,
dim*dim,
(const float*)d_B, dim,
dim*dim,
&beta,
d_C, dim,
dim*dim,
batch_count);
( cudaEventRecord(stop, 0) );
( cudaEventSynchronize(stop) );
( cudaEventElapsedTime(&time_cuda_event, start, stop) );
printf("Time : %3.1f ms \n", time_cuda_event);
cudaMemcpy(h_C,d_C,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);
// Destroy the handle
cublasDestroy(handle);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);
return 0;
}
Upvotes: 4
Views: 2986
Reputation: 72372
This appears to just be the result of heuristics within CUBLAS. If I run a modified (and working) version of your code I get these timings for a 5x5 case:
Batch size : 10 Time : 0.019104 ms
Batch size : 100 Time : 0.038304 ms
Batch size : 1000 Time : 0.163520 ms
Batch size : 10000 Time : 1.410944 ms
Batch size : 100000 Time : 1.614144 ms
Batch size : 1000000 Time : 16.057407 ms
Profiling shows that at cases up to batches with 10000 entries, the library runs one kernel:
1.10759s 16.831us (1 1 10) (128 1 1) 120 12.250KB 0B - - - - GeForce GTX 970 1 7 maxwell_sgemm_128x64_nn [3939]
1.10766s 19.168us (1 1 100) (128 1 1) 120 12.250KB 0B - - - - GeForce GTX 970 1 7 maxwell_sgemm_128x64_nn [3971]
1.10773s 147.71us (1 1 1000) (128 1 1) 120 12.250KB 0B - - - - GeForce GTX 970 1 7 maxwell_sgemm_128x64_nn [4003]
1.10791s 1.4064ms (1 1 10000) (128 1 1) 120 12.250KB 0B - - - - GeForce GTX 970 1 7 maxwell_sgemm_128x64_nn [4035]
while at larger sizes it runs multiple calls to another kernel to service the call:
1.10935s 1.1518ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4063]
1.11050s 606.54us (1 1 34465) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4087]
1.11113s 1.1498ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4115]
1.11228s 1.1501ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4139]
1.11344s 1.1511ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4163]
1.11459s 1.1494ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4187]
1.11574s 1.1507ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4211]
1.11689s 1.1503ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4235]
1.11804s 1.1499ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4259]
1.11919s 1.1507ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4283]
1.12035s 1.1507ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4307]
1.12150s 1.1509ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4331]
1.12265s 1.1489ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4355]
1.12380s 1.1496ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4379]
1.12495s 1.1500ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4403]
1.12610s 1.1494ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4427]
1.12726s 1.1503ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4451]
1.12841s 299.35us (1 1 16975) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4475]
The inconsistency you have observed seems to be caused by the change from one kernel to the other within the library, which is probably made by some batch size criteria. You can see that both kernels seem to use one block per batch item, with the kernel used at larger sizes using a 2D block with 256 threads, whereas the smaller size kernel uses a 1D block with 128 threads. Beyond that, the performance differences are down to internal implementation details. Even though it is probably a violation of the End User Licence to do so, if you want to understand more you will need to disassemble the kernels and look at how they work. The toolkit contains all the tools required to do this, although I am not suggesting you do so.
Upvotes: 4