cublas one function call produced three executions

Question

I called the cublas_Sgemm_v2 function for 10236 times with first matrix non-transposed and the second transposed. However, in the nvprof results, I saw three items produced from that function call. The (m, n, k) values to the function call are (588, 588, 20).

There are the items listed in the nvprof results.

Time(%)      Time     Calls       Avg       Min       Max  Name
 12.32%  494.86ms     10236  48.344us  47.649us  49.888us  sgemm_sm35_ldg_nt_128x8x128x16x16
  8.64%  346.91ms     10236  33.890us  32.352us  35.488us  sgemm_sm35_ldg_nt_64x16x128x8x32
  8.11%  325.63ms     10236  31.811us  31.360us  32.512us  sgemm_sm35_ldg_nt_128x16x64x16x16

Is this expected and why is that? Can someone explain what do the values in the function names such as sgemm_sm35_ldg_nt_128x8x128x16x16 mean?

I also have other function calls to cublas_Sgemm_v2 with different transpose settings and I only see one item per each function call.

UPDATE:

As @Marco13 asked, I put more results here:

Time(%)      Time     Calls       Avg       Min       Max  Name
--------------------------------------------------------------------------------

Resulted from 7984 calls with (Trans, NonTrans) with (m, n, k) = (588, 100, 588)
 20.84%  548.30ms      7984  68.675us  58.977us  81.474us  sgemm_sm35_ldg_tn_32x16x64x8x16

Resulted from 7984 calls with (NonTrans, NonTrans) with (m, n, k) = (588, 100, 588)
 12.95%  340.71ms      7984  42.674us  21.856us  64.514us  sgemm_sm35_ldg_nn_64x16x64x16x16

All the following resulted from 3992 calls with (NonTrans, Trans) with (m, n, k) = (588, 588, 100)
  9.81%  258.15ms      3992  64.666us  61.601us  68.642us  sgemm_sm35_ldg_nt_128x8x128x16x16
  6.84%  179.90ms      3992  45.064us  40.097us  49.505us  sgemm_sm35_ldg_nt_64x16x128x8x32
  6.33%  166.51ms      3992  41.709us  38.304us  61.185us  sgemm_sm35_ldg_nt_128x16x64x16x16

Another run with 588 changed to 288:

Time(%)      Time     Calls       Avg       Min       Max  Name
--------------------------------------------------------------------------------

Resulted from 7984 calls with (Trans, NonTrans) with (m, n, k) = (288, 100, 288)
 22.01%  269.11ms      7984  33.706us  30.273us  39.232us  sgemm_sm35_ldg_tn_32x16x64x8x16

Resulted from 7984 calls with (NonTrans, NonTrans) with (m, n, k) = (288, 100, 288)
 14.79%  180.78ms      7984  22.642us  18.752us  26.752us  sgemm_sm35_ldg_nn_64x16x64x16x16

Resulted from 3992 calls with (NonTrans, Trans) with (m, n, k) = (288, 288, 100)
  7.43%  90.886ms      3992  22.766us  19.936us  25.024us  sgemm_sm35_ldg_nt_64x16x64x16x16

From the last three lines is looks like certain transposition types can be more efficient than the others, and certain matrix sizes are more economic in terms of computation time over matrix size. What is the guideline of ensuring economic computation?

UPDATE 2:

For the case of (m, n, k) = (588, 100, 588) above, I manually transposed the matrix before calling the sgemm function, then there is only one item in the nvprof result. The time it take is only a little less than the sum of the two items in the above table. So there is no much performance gain from doing so.

Time(%)      Time     Calls       Avg       Min       Max  Name
--------------------------------------------------------------------------------
 31.65%  810.59ms     15968  50.763us  21.505us  72.098us  sgemm_sm35_ldg_nn_64x16x64x16x16

Marco13 · Accepted Answer

Sorry, not an answer - but slightly too long for a comment:

Concerning the edit, about the influence of the "transpose" state: Transposing a matrix might cause an access pattern that is worse in terms of memory coalescing. A quick websearch brings brings some results about this ( https://devtalk.nvidia.com/default/topic/528450/cuda-programming-and-performance/cublas-related-question/post/3734986/#3734986 ), but with a slightly different setup than yours:

DGEMM performance on a K20c

args: ta=N tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096 elapsed = 0.13280010 sec GFLOPS=1034.93

args: ta=T tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096 elapsed = 0.13872910 sec GFLOPS=990.7

args: ta=N tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096 elapsed = 0.12521601 sec GFLOPS=1097.61

args: ta=T tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096 elapsed = 0.13652611 sec GFLOPS=1006.69

In this case, the differences do not seem worth the hassle of changing the matrix storage (e.g. from column-major to row-major, to avoid transposing the matrix), because all patterns seem to run with a similar speed. But your mileage may vary - particularly, the difference in your tests between (t,n) and (n,n) are very large (548ms vs 340ms), which I found quite surprising. If you have the choice to easily switch between various representations of the matrix, then a benchmark covering all the four cases may be worthwhile.

In any case, regarding your question about the functions that are called there: The CUBLAS code for the sgemm function in CUBLAS 1.1 was already full of unrolled loops and already contained 80 (!) versions of the sgemm function for different cases that have been assembled using a #define-hell. It has to be assumed that this has become even more unreadable in the newer CUBLAS versions, where the newer compute capabilities have to be taken into account - and the function names that you found there indicated that this indeed is the case:

sgemm_sm35_ldg_nt_64x16x128x8x32

sm35 : Runs on a device with compute capability 3.5
ldg : ? Non-texture-memory version ? (CUBLAS 1.1 contained functions called sgemm_main_tex_* which worked on texture memory, and functions sgemm_main_gld_* which worked on normal, global memory)
nt : First matrix is Not transposed, second one is Transposed
64x16x128x8x32 - Probably related to tile sizes, maybe shared memory etc...

Still, I think it's surprising that a single call to sgemm causes three of these internal functions to be called. But as mentioned in the comment: I assume that they try to handle the "main" part of the matrix with a specialized, efficient version, and "border tiles" with one that is capable of doing range checks and/or cope with warps that are not full. (Not very precise, just to be suggestive: A matrix of size 288x288 could be handled by an efficient core for matrices of size 256x256, and two calls for the remaining 32x288 and 288x32 entries).

But all this is also the reason why I guess there can hardly be a general guideline concerning the matrix sizes: The "best" matrix size in terms of computation time over matrix size will at least depend on

the hardware version (compute capability) of the target system
the transposing-flags
the CUBLAS version

EDIT Concerning the comment: One could imagine that there should be a considerable difference between the transosed and the non-transposed processing. When multiplying two matrices

a00 a01 a02     b00 b01 b02
a10 a11 a12  *  b10 b11 b12
a20 a21 a22     b20 b21 b22

Then the first element of the result will be

a00 * b00 + a01 * b10 + a02 * b20

(which simply is the dot product of the first row of a and the first column of b). For this computation one has to read consecutive values from a. But the values that are read from b are not consecutive. Instead, they are "the first value in each row". One could think that this would have a negative impact on memory coalescing. But for sure, the NVIDIA engineers have tried hard to avoid any negative impact here, and the implementation of sgemm in CUBLAS is far, far away from "a parallel version of the naive 3-nested-loops implementation" where this access pattern would have such an obvious drawback.

cublas one function call produced three executions

Answers (1)

Related Questions