Reputation: 1964
I called the cublas_Sgemm_v2
function for 10236 times with first matrix non-transposed and the second transposed. However, in the nvprof
results, I saw three items produced from that function call. The (m, n, k) values to the function call are (588, 588, 20).
There are the items listed in the nvprof
results.
Time(%) Time Calls Avg Min Max Name
12.32% 494.86ms 10236 48.344us 47.649us 49.888us sgemm_sm35_ldg_nt_128x8x128x16x16
8.64% 346.91ms 10236 33.890us 32.352us 35.488us sgemm_sm35_ldg_nt_64x16x128x8x32
8.11% 325.63ms 10236 31.811us 31.360us 32.512us sgemm_sm35_ldg_nt_128x16x64x16x16
Is this expected and why is that? Can someone explain what do the values in the function names such as sgemm_sm35_ldg_nt_128x8x128x16x16
mean?
I also have other function calls to cublas_Sgemm_v2 with different transpose settings and I only see one item per each function call.
UPDATE:
As @Marco13 asked, I put more results here:
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
Resulted from 7984 calls with (Trans, NonTrans) with (m, n, k) = (588, 100, 588)
20.84% 548.30ms 7984 68.675us 58.977us 81.474us sgemm_sm35_ldg_tn_32x16x64x8x16
Resulted from 7984 calls with (NonTrans, NonTrans) with (m, n, k) = (588, 100, 588)
12.95% 340.71ms 7984 42.674us 21.856us 64.514us sgemm_sm35_ldg_nn_64x16x64x16x16
All the following resulted from 3992 calls with (NonTrans, Trans) with (m, n, k) = (588, 588, 100)
9.81% 258.15ms 3992 64.666us 61.601us 68.642us sgemm_sm35_ldg_nt_128x8x128x16x16
6.84% 179.90ms 3992 45.064us 40.097us 49.505us sgemm_sm35_ldg_nt_64x16x128x8x32
6.33% 166.51ms 3992 41.709us 38.304us 61.185us sgemm_sm35_ldg_nt_128x16x64x16x16
Another run with 588 changed to 288:
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
Resulted from 7984 calls with (Trans, NonTrans) with (m, n, k) = (288, 100, 288)
22.01% 269.11ms 7984 33.706us 30.273us 39.232us sgemm_sm35_ldg_tn_32x16x64x8x16
Resulted from 7984 calls with (NonTrans, NonTrans) with (m, n, k) = (288, 100, 288)
14.79% 180.78ms 7984 22.642us 18.752us 26.752us sgemm_sm35_ldg_nn_64x16x64x16x16
Resulted from 3992 calls with (NonTrans, Trans) with (m, n, k) = (288, 288, 100)
7.43% 90.886ms 3992 22.766us 19.936us 25.024us sgemm_sm35_ldg_nt_64x16x64x16x16
From the last three lines is looks like certain transposition types can be more efficient than the others, and certain matrix sizes are more economic in terms of computation time over matrix size. What is the guideline of ensuring economic computation?
UPDATE 2:
For the case of (m, n, k) = (588, 100, 588) above, I manually transposed the matrix before calling the sgemm function, then there is only one item in the nvprof
result. The time it take is only a little less than the sum of the two items in the above table. So there is no much performance gain from doing so.
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
31.65% 810.59ms 15968 50.763us 21.505us 72.098us sgemm_sm35_ldg_nn_64x16x64x16x16
Upvotes: 0
Views: 396
Reputation: 54639
Sorry, not an answer - but slightly too long for a comment:
Concerning the edit, about the influence of the "transpose" state: Transposing a matrix might cause an access pattern that is worse in terms of memory coalescing. A quick websearch brings brings some results about this ( https://devtalk.nvidia.com/default/topic/528450/cuda-programming-and-performance/cublas-related-question/post/3734986/#3734986 ), but with a slightly different setup than yours:
DGEMM performance on a K20c
args: ta=N tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096 elapsed = 0.13280010 sec GFLOPS=1034.93
args: ta=T tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096 elapsed = 0.13872910 sec GFLOPS=990.7
args: ta=N tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096 elapsed = 0.12521601 sec GFLOPS=1097.61
args: ta=T tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096 elapsed = 0.13652611 sec GFLOPS=1006.69
In this case, the differences do not seem worth the hassle of changing the matrix storage (e.g. from column-major to row-major, to avoid transposing the matrix), because all patterns seem to run with a similar speed. But your mileage may vary - particularly, the difference in your tests between (t,n) and (n,n) are very large (548ms vs 340ms), which I found quite surprising. If you have the choice to easily switch between various representations of the matrix, then a benchmark covering all the four cases may be worthwhile.
In any case, regarding your question about the functions that are called there: The CUBLAS code for the sgemm
function in CUBLAS 1.1 was already full of unrolled loops and already contained 80 (!) versions of the sgemm
function for different cases that have been assembled using a #define
-hell. It has to be assumed that this has become even more unreadable in the newer CUBLAS versions, where the newer compute capabilities have to be taken into account - and the function names that you found there indicated that this indeed is the case:
sgemm_sm35_ldg_nt_64x16x128x8x32
sm35
: Runs on a device with compute capability 3.5ldg
: ? Non-texture-memory version ? (CUBLAS 1.1 contained functions called sgemm_main_tex_*
which worked on texture memory, and functions sgemm_main_gld_*
which worked on normal, global memory)nt
: First matrix is Not transposed, second one is Transposed64x16x128x8x32
- Probably related to tile sizes, maybe shared memory etc...Still, I think it's surprising that a single call to sgemm
causes three of these internal functions to be called. But as mentioned in the comment: I assume that they try to handle the "main" part of the matrix with a specialized, efficient version, and "border tiles" with one that is capable of doing range checks and/or cope with warps that are not full. (Not very precise, just to be suggestive: A matrix of size 288x288 could be handled by an efficient core for matrices of size 256x256, and two calls for the remaining 32x288 and 288x32 entries).
But all this is also the reason why I guess there can hardly be a general guideline concerning the matrix sizes: The "best" matrix size in terms of computation time over matrix size will at least depend on
EDIT Concerning the comment: One could imagine that there should be a considerable difference between the transosed and the non-transposed processing. When multiplying two matrices
a00 a01 a02 b00 b01 b02
a10 a11 a12 * b10 b11 b12
a20 a21 a22 b20 b21 b22
Then the first element of the result will be
a00 * b00 + a01 * b10 + a02 * b20
(which simply is the dot product of the first row of a
and the first column of b
). For this computation one has to read consecutive values from a
. But the values that are read from b
are not consecutive. Instead, they are "the first value in each row". One could think that this would have a negative impact on memory coalescing. But for sure, the NVIDIA engineers have tried hard to avoid any negative impact here, and the implementation of sgemm
in CUBLAS is far, far away from "a parallel version of the naive 3-nested-loops implementation" where this access pattern would have such an obvious drawback.
Upvotes: 2