alessandrolenzi
alessandrolenzi

Reputation: 91

Dynamic memory slow down on Intel Xeon Phi

i am creating a simple matrix multiplication procedure, operating on the Intel Xeon Phi architecture.The procedure looks like this (parameters are A, B, C), and the timing doesn't include initialization:

//start timing
for(int i = 0; i < size; i++){
    for(int k = 0; k < size; k++) {
        register TYPE aik = A[i][k];
        for(int j = 0; j < size; j++) {
              C[i][j] += aik * B[k][j];
        }
    }
}
//end timing

I am using restrict, aligned data and so on. However, if the matrices are allocated using dynamic memory (posix_memalign), the computation incurs in a severe slow down, i.e. for TYPE=float and 512x512 matrices takes ~0.55s in the dynamic case while in the other case ~0.25. On a different architecture (Intel Xeon E5), there is also a slow down, but it is barely noticeable (about 0.002 s).

Any help is apreciated!

Upvotes: 1

Views: 320

Answers (2)

Ravi Murty
Ravi Murty

Reputation: 1

In the "non-dynamic" case, are the arrays just global variables? If so, they end up in BSS and when the ELF is loaded, the OS will initialize them to zero by default - that's how BSS works. If you allocate them dynamically, independent of what method you use (i.e. malloc, new, posix_memalign, exception is mmap(MAP_POPULATE)), you'll cause faults in the OS when you touch the memory. Fault handling is always expensive. It is relatively more expensive on the Coprocessor because you're running on a tiny little core from a single threaded performance standpoint.

Upvotes: 0

amckinley
amckinley

Reputation: 629

What happens to the timing differences if you make the matrix a different size? (e.g. 513x513)

The reason why I ask is I think you might be seeing this effect due to exceeding cache way associativity and evicting elements of C[i][] from L2 as you loop over B in the loop over k. If B and C are aligned and the sizes are powers of 2, you might get cache super-alignment causing this issue.

If B and C are on the stack or otherwise not aligned, you don't see this effect as fewer addresses are power of 2 aligned.

Upvotes: 1

Related Questions