Reputation: 91
i am creating a simple matrix multiplication procedure, operating on the Intel Xeon Phi architecture.The procedure looks like this (parameters are A, B, C), and the timing doesn't include initialization:
//start timing
for(int i = 0; i < size; i++){
for(int k = 0; k < size; k++) {
register TYPE aik = A[i][k];
for(int j = 0; j < size; j++) {
C[i][j] += aik * B[k][j];
}
}
}
//end timing
I am using restrict, aligned data and so on. However, if the matrices are allocated using dynamic memory (posix_memalign), the computation incurs in a severe slow down, i.e. for TYPE=float and 512x512 matrices takes ~0.55s in the dynamic case while in the other case ~0.25. On a different architecture (Intel Xeon E5), there is also a slow down, but it is barely noticeable (about 0.002 s).
Any help is apreciated!
Upvotes: 1
Views: 320
Reputation: 1
In the "non-dynamic" case, are the arrays just global variables? If so, they end up in BSS and when the ELF is loaded, the OS will initialize them to zero by default - that's how BSS works. If you allocate them dynamically, independent of what method you use (i.e. malloc, new, posix_memalign, exception is mmap(MAP_POPULATE)), you'll cause faults in the OS when you touch the memory. Fault handling is always expensive. It is relatively more expensive on the Coprocessor because you're running on a tiny little core from a single threaded performance standpoint.
Upvotes: 0
Reputation: 629
What happens to the timing differences if you make the matrix a different size? (e.g. 513x513)
The reason why I ask is I think you might be seeing this effect due to exceeding cache way associativity and evicting elements of C[i][] from L2 as you loop over B in the loop over k. If B and C are aligned and the sizes are powers of 2, you might get cache super-alignment causing this issue.
If B and C are on the stack or otherwise not aligned, you don't see this effect as fewer addresses are power of 2 aligned.
Upvotes: 1