Reputation: 589
Background
If you have been following my posts, I am attempting to replicate the results found in Kazushige Goto's seminal paper on square matrix multiplication C = AB
. My last post regarding this topic can be found here. In that version of my code, I follow the memory layering and packing strategy of Goto with an inner kernel computing 2x8
blocks of C
using 128 bit SSE3 intrinsics. My CPU is i5-540M with hyperthreading off. Additional info about my hardware can be found in another post and is repeated below.
My Hardware
My CPU is an Intel i5 - 540M. You can find the relevant CPUID information on cpu-world.com. The microarchitecture is Nehalem (westmere), so it can theoretically compute 4 double precision flops per core per cycle. I will be using just one core (no OpenMP), so with hyperthreading off and 4-step Intel Turbo Boost, I should be seeing a peak of ( 2.533 Ghz + 4*0.133 Ghz ) * ( 4 DP flops/core/cycle ) * ( 1 core ) = 12.27 DP Gflops
. For reference, with both cores running at peak, Intel Turbo Boost gives a 2-step speed up and I should get a theoretical peak of 22.4 DP Gflops
.
My Software
Windows7 64 bit, but MinGW/GCC 32 bit due to restrictions on my computer.
What's new this time?
2x4
blocks of C
. This gives better performance and is in line with what Goto says (half the registers shoubld be used to compute C
). I have tried many sizes: 1x8
, 2x8
, 2x4
, 4x2
, 2x2
, 4x4
.clock()
.Some Preliminary Results
N
is the dimension of the square matrices, Total Gflops/s
is the Gflops/s of the entire code, and Kernel Gflops/s
is the Gflops/s of the inner kernel. You can see that with a 12.26 Gflops/s peak on one core, the inner kernel is getting around 75% efficiency, while the overall code is about 60% efficient.
I would like to get closer to 95% efficiency for the kernel and 80% for the overall code. What else can I do to improve the performance, of at least the inner kernel?
N Total Gflops/s Kernel Gflops/s
256 7.778089 9.756284
512 7.308523 9.462700
768 7.223283 9.253639
1024 7.197375 9.132235
1280 7.142538 8.974122
1536 7.114665 8.967249
1792 7.060789 8.861958
Source Code
If you are feeling particularly magnanimous, please test my code on your machine. Compiled with gcc -std=c99 -O2 -m32 -msse3 -mincoming-stack-boundary=2 -masm=intel somatmul2.c -o somatmul2.exe
. Feel free to try other flags, but I have found these to work best on my machine.
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include <xmmintrin.h>
#include <math.h>
#include <omp.h>
#include <stdint.h>
#include <windows.h>
#ifndef min
#define min( a, b ) ( ((a) < (b)) ? (a) : (b) )
#endif
inline void
__attribute__ ((gnu_inline))
__attribute__ ((aligned(64))) rpack( double* restrict dst,
const double* restrict src,
const int kc, const int mc, const int mr, const int n)
{
double tmp[mc*kc] __attribute__ ((aligned(64)));
double* restrict ptr = &tmp[0];
for (int i = 0; i < mc; ++i)
for (int j = 0; j < kc; j+=4) {
*ptr++ = *(src + i*n + j );
*ptr++ = *(src + i*n + j+1);
*ptr++ = *(src + i*n + j+2);
*ptr++ = *(src + i*n + j+3);
}
ptr = &tmp[0];
//const int inc_dst = mr*kc;
for (int k = 0; k < mc; k+=mr)
for (int j = 0; j < kc; ++j)
for (int i = 0; i < mr*kc; i+=kc)
*dst++ = *(ptr + k*kc + j + i);
}
inline void
__attribute__ ((gnu_inline))
__attribute__ ((aligned(64))) cpack(double* restrict dst,
const double* restrict src,
const int nc,
const int kc,
const int nr,
const int n)
{ //nc cols, kc rows, nr size ofregister strips
double tmp[kc*nc] __attribute__ ((aligned(64)));
double* restrict ptr = &tmp[0];
for (int i = 0; i < kc; ++i)
for (int j = 0; j < nc; j+=4) {
*ptr++ = *(src + i*n + j );
*ptr++ = *(src + i*n + j+1);
*ptr++ = *(src + i*n + j+2);
*ptr++ = *(src + i*n + j+3);
}
ptr = &tmp[0];
// const int inc_k = nc/nr;
for (int k = 0; k < nc; k+=nr)
for (int j = 0; j < kc*nc; j+=nc)
for (int i = 0; i < nr; ++i)
*dst++ = *(ptr + k + i + j);
}
#define KERNEL0(add0,add1,add2,add3) \
"mulpd xmm4, xmm6 \n\t" \
"addpd xmm0, xmm4 \n\t" \
"movapd xmm4, XMMWORD PTR [edx+"#add2"] \n\t" \
"mulpd xmm7, xmm4 \n\t" \
"addpd xmm1, xmm7 \n\t" \
"movddup xmm5, QWORD PTR [eax+"#add0"] \n\t" \
"mulpd xmm6, xmm5 \n\t" \
"addpd xmm2, xmm6 \n\t" \
"movddup xmm7, QWORD PTR [eax+"#add1"] \n\t" \
"mulpd xmm4, xmm5 \n\t" \
"movapd xmm6, XMMWORD PTR [edx+"#add3"] \n\t" \
"addpd xmm3, xmm4 \n\t" \
"movapd xmm4, xmm7 \n\t" \
" \n\t"
inline void
__attribute__ ((gnu_inline))
__attribute__ ((aligned(64))) dgemm_2x4_asm_j
(
const int mc, const int nc, const int kc,
const double* restrict locA, const int cs_a, // mr
const double* restrict locB, const int rs_b, // nr
double* restrict C, const int rs_c
)
{
const double* restrict a1 = locA;
for (int i = 0; i < mc ; i+=cs_a) {
const double* restrict b1 = locB;
double* restrict c11 = C + i*rs_c;
for (int j = 0; j < nc ; j+=rs_b) {
__asm__ __volatile__
(
"mov eax, %[a1] \n\t"
"mov edx, %[b1] \n\t"
"mov edi, %[c11] \n\t"
"mov ecx, %[kc] \n\t"
"pxor xmm0, xmm0 \n\t"
"movddup xmm7, QWORD PTR [eax] \n\t" // a1
"pxor xmm1, xmm1 \n\t"
"movapd xmm6, XMMWORD PTR [edx] \n\t" // b1
"pxor xmm2, xmm2 \n\t"
"movapd xmm4, xmm7 \n\t" // a1
"pxor xmm3, xmm3 \n\t"
"sar ecx, 3 \n\t" // divide by 2^num
"L%=: \n\t" // start loop
KERNEL0( 8, 16, 16, 32)
KERNEL0( 24, 32, 48, 64)
KERNEL0( 40, 48, 80, 96)
KERNEL0( 56, 64, 112, 128)
KERNEL0( 72, 80, 144, 160)
KERNEL0( 88, 96, 176, 192)
KERNEL0( 104, 112, 208, 224)
KERNEL0( 120, 128, 240, 256)
"add eax, 128 \n\t"
"add edx, 256 \n\t"
" \n\t"
"dec ecx \n\t"
"jne L%= \n\t" // end loop
" \n\t"
"mov esi, %[rs_c] \n\t" // don't need cs_a anymore
"sal esi, 3 \n\t" // times 8
"lea ebx, [edi+esi] \n\t" // don't need b1 anymore
"addpd xmm0, XMMWORD PTR [edi] \n\t" // c11
"addpd xmm1, XMMWORD PTR [edi+16] \n\t" // c11 + 2
"addpd xmm2, XMMWORD PTR [ebx] \n\t" // c11
"addpd xmm3, XMMWORD PTR [ebx+16] \n\t" // c11 + 2
"movapd XMMWORD PTR [edi], xmm0 \n\t"
"movapd XMMWORD PTR [edi+16], xmm1 \n\t"
"movapd XMMWORD PTR [ebx], xmm2 \n\t"
"movapd XMMWORD PTR [ebx+16], xmm3 \n\t"
: // no outputs
: // inputs
[kc] "m" (kc),
[a1] "m" (a1),
[cs_a] "m" (cs_a),
[b1] "m" (b1),
[rs_b] "m" (rs_b),
[c11] "m" (c11),
[rs_c] "m" (rs_c)
: //register clobber
"memory",
"eax","ebx","ecx","edx","esi","edi",
"xmm0","xmm1","xmm2","xmm3","xmm4","xmm5","xmm6","xmm7"
);
b1 += rs_b*kc;
c11 += rs_b;
}
a1 += cs_a*kc;
}
}
double blis_dgemm_ref(
const int n,
const double* restrict A,
const double* restrict B,
double* restrict C,
const int mc,
const int nc,
const int kc
)
{
int mr = 2;
int nr = 4;
double locA[mc*kc] __attribute__ ((aligned(64)));
double locB[kc*nc] __attribute__ ((aligned(64)));
LARGE_INTEGER frequency, time1, time2;
double time3 = 0.0;
QueryPerformanceFrequency(&frequency);
// zero C
memset(C, 0, n*n*sizeof(double));
int ii,jj,kk;
//#pragma omp parallel num_threads(2) shared(A,B,C) private(ii,jj,kk,locA,locB)
{//use all threads in parallel
//#pragma omp for
for ( jj = 0; jj < n; jj+=nc) {
for ( kk = 0; kk < n; kk+=kc) {
cpack(locB, B + kk*n + jj, nc, kc, nr, n);
for ( ii = 0; ii < n; ii+=mc) {
rpack(locA, A + ii*n + kk, kc, mc, mr, n);
QueryPerformanceCounter(&time1);
dgemm_2x4_asm_j( mc, nc, kc,
locA , mr,
locB , nr,
C + ii*n + jj, n );
QueryPerformanceCounter(&time2);
time3 += (double) (time2.QuadPart - time1.QuadPart);
}
}
}
}
return time3 / frequency.QuadPart;
}
double compute_gflops(const double time, const int n)
{
// computes the gigaflops for a square matrix-matrix multiplication
double gflops;
gflops = (double) (2.0*n*n*n)/time/1.0e9;
return(gflops);
}
void main() {
LARGE_INTEGER frequency, time1, time2;
double time3, best_time;
double kernel_time, best_kernel_time;
QueryPerformanceFrequency(&frequency);
int best_flag;
double gflops, kernel_gflops;
const int trials = 100;
int nmax = 4096;
printf("%16s %16s %16s\n","N","Total Gflops/s","Kernel Gflops/s");
int mc = 256;
int kc = 256;
int nc = 128;
for (int n = kc; n <= nmax; n+=kc) {
double *A = NULL;
double *B = NULL;
double *C = NULL;
A = _mm_malloc (n*n * sizeof(*A),64); if (!A) {printf("A failed\n"); return;}
B = _mm_malloc (n*n * sizeof(*B),64); if (!B) {printf("B failed\n"); return;}
C = _mm_malloc (n*n * sizeof(*C),64); if (!C) {printf("C failed\n"); return;}
srand(time(NULL));
// Create the matrices
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
*(A + i*n + j) = (double) rand()/RAND_MAX;
*(B + i*n + j) = (double) rand()/RAND_MAX;
}
}
// warmup
blis_dgemm_ref(n,A,B,C,mc,nc,kc);
for (int count = 0; count < trials; count++){
QueryPerformanceCounter(&time1);
kernel_time = blis_dgemm_ref(n,A,B,C,mc,nc,kc);
QueryPerformanceCounter(&time2);
time3 = (double)(time2.QuadPart - time1.QuadPart) / frequency.QuadPart;
if (count == 0) {
best_time = time3;
best_kernel_time = kernel_time;
}
else {
best_flag = ( time3 < best_time ? 1 : 0 );
if (best_flag) {
best_time = time3;
best_kernel_time = kernel_time;
}
}
}
gflops = compute_gflops(best_time, n);
kernel_gflops = compute_gflops(best_kernel_time, n);
printf("%16d %16f %16f\n",n,gflops,kernel_gflops);
_mm_free(A);
_mm_free(B);
_mm_free(C);
}
printf("tests are done\n");
}
EDIT
Replace the packing functions with the following new and faster versions:
inline void
__attribute__ ((gnu_inline))
__attribute__ ((aligned(64))) rpack( double* restrict dst,
const double* restrict src,
const int kc, const int mc, const int mr, const int n)
{
for (int i = 0; i < mc/mr; ++i)
for (int j = 0; j < kc; ++j)
for (int k = 0; k < mr; ++k)
*dst++ = *(src + i*n*mr + k*n + j);
}
inline void
__attribute__ ((gnu_inline))
__attribute__ ((aligned(64))) cpack(double* restrict dst,
const double* restrict src,
const int nc,
const int kc,
const int nr,
const int n)
{
for (int i = 0; i < nc/nr; ++i)
for (int j = 0; j < kc; ++j)
for (int k = 0; k < nr; ++k)
*dst++ = *(src + i*nr + j*n + k);
}
Results With New Packing Functions
Nice boost to the overall performance:
N Total Gflops/s Kernel Gflops/s
256 7.915617 8.794849
512 8.466467 9.350920
768 8.354890 9.135575
1024 8.168944 8.884611
1280 8.174249 8.825920
1536 8.285458 8.938712
1792 7.988038 8.581001
LINPACK 32-bit with 1 thread
CPU frequency: 2.792 GHz
Number of CPUs: 1
Number of cores: 2
Number of threads: 1
Performance Summary (GFlops)
Size LDA Align. Average Maximal
128 128 4 4.7488 5.0094
256 256 4 6.0747 6.9652
384 384 4 6.5208 7.2767
512 512 4 6.8329 7.5706
640 640 4 7.4278 7.8835
768 768 4 7.7622 8.0677
896 896 4 7.8860 8.4737
1024 1024 4 7.7292 8.1076
1152 1152 4 8.0411 8.4738
1280 1280 4 8.1429 8.4863
1408 1408 4 8.2284 8.7073
1536 1536 4 8.3753 8.6437
1664 1664 4 8.6993 8.9108
1792 1792 4 8.7576 8.9176
1920 1920 4 8.7945 9.0678
2048 2048 4 8.5490 8.8827
2176 2176 4 9.0138 9.1161
2304 2304 4 8.1402 9.1446
2432 2432 4 9.0003 9.2082
2560 2560 4 8.8560 9.1197
2688 2688 4 9.1008 9.3144
2816 2816 4 9.0876 9.3089
2944 2944 4 9.0771 9.4191
3072 3072 4 8.9402 9.2920
3200 3200 4 9.2259 9.3699
3328 3328 4 9.1224 9.3821
3456 3456 4 9.1354 9.4082
3584 3584 4 9.0489 9.3351
3712 3712 4 9.3093 9.5108
3840 3840 4 9.3307 9.5324
3968 3968 4 9.3895 9.5352
4096 4096 4 9.3269 9.3872
EDIT 2
Here are some results from single-threaded OpenBLAS, which took 4 hours to compile last night. As you can see, it is getting close to 95% CPU usage. Max single-threaded performance with both cores on is 11.2 Gflops
(2 step Intel Turbo Boost). I need to turn off the other core to get 12.26 Gflops
(4 step Intel Turbo Boost). Assume that the packing functions in OpeBLAS generate no additional overhead. Then the OpenBLAS kernel must be running at least as fast as the total OpenBLAS code. So I need to get my kernel running at that speed. I have yet to figure out how to make my assembly faster. I will be focusing on this over the next few days.
Ran the tests below from Windows command line with: start /realtime /affinity 1
My Code:
N Total Gflops/s Kernel Gflops/s
256 7.927740 8.832366
512 8.427591 9.347094
768 8.547722 9.352993
1024 8.597336 9.351794
1280 8.588663 9.296724
1536 8.589808 9.271710
1792 8.634201 9.306406
2048 8.527889 9.235653
OpenBLAS:
N Total Gflops/s
256 10.599065
512 10.622686
768 10.745133
1024 10.762757
1280 10.832540
1536 10.793132
1792 10.848356
2048 10.819986
Upvotes: 22
Views: 1819
Reputation: 1755
It's theoretically possible to look at that code and reason through whether it could be arranged to make better use of microarchitectural resources - but even the performance architects at Intel might not recommend doing it that way. It might help to use a tool like VTune or Intel Performance Counter Monitor to find out how much of your workload is memory versus front-end versus back-end bound. Intel Architecture Code Analyzer might also be a quick source of help narrowing down which of the potential issues listed below to follow up on first.
Nominal Animal is probably on the right track in the comments talking about interleaving instructions that access memory and those that do computation. A few other possibilities:
mulpd
is always going to dispatch to port 1 on Westmere. Maybe if there are any cycles where port 0 isn't getting used, you could sneak in a scalar FP multiply there.dgemm_2x4_asm_j
is faking out the prefetchers.Upvotes: 1