Reputation:
I want to shift the first row of a 2D square matrix to the last row. So if I have a matrix like A, I want to get B.
I can do this using two simple for loops. E.g.
void shift(int M, int N, int A[M][N]){
int i, j,temp;
for (i = 1; i < M; i++){
for (j = 0; j < N; j++){
temp=A[i][j];
A[i][j]=A[i-1][j];
A[i-1][j]=temp;
}
}
}
But I want to get as few cache misses as possible. Any tip on how to do that?
Upvotes: 3
Views: 422
Reputation: 363902
I just saw your comment about 4x4 matrices. A 4x4 array of int
fits in a single cache line (on modern x86 CPUs, where a cache line is 64B). In that case, you want the compiler to generate something like
## matrix address in [rdi]
movups xmm0, [rdi]
movups xmm1, [rdi+16]
movups xmm2, [rdi+32]
movups xmm3, [rdi+48]
movups [rdi], xmm1 ; doing all the stores after all the loads avoids any possible false dependency
movups [rdi+16], xmm2
movups [rdi+32], xmm3
movups [rdi+48], xmm0
Or maybe fewer AVX 256b loads/stores, but unaligned AVX might do worse. If the array is 64B-aligned, then none of the required loads/stores would cross cache line boundaries. So 2x vmovups ymm
loads, and one vmovups ymm
store, one vmovups xmm
store (to the end), and one vextractf128
store (to the start).
If you're lucky, John's memcpy will optimize away to something like this when the function inlines into a caller that has compile-time-constant values of 4
.
For tiny arrays, it's not cache-misses that's the issue, just how to get the whole copy to happen with as little overhead as possible. My ideas below about introducing a level of indirection are not a good idea, because it's really cheap to load all the data and store it back out.
If you leave room at the end of your matrix for another row, you can just copy the first row to this extra space, and pass a pointer to what was the 2nd row.
This lets you temporarily have a different view of a matrix, but it's not a repeatable process.
If you had a big buffer, you could go on rotating the matrix rows this way until you get to the end of the reserved space and have to copy the array back to the top of the buffer. This minimizes copying overhead but does mean that you're touching some new memory.
If row-copying overhead is a big deal, introducing a level of indirection might be a good idea. Depending on the access pattern of the code that uses it after you shuffle the rows, this might be worse. Instead of a normal 2D array, this might be a use case for an array of pointers to rows.
You can and should allocate the storage for the matrix with one big allocation, instead of allocating each row separately. A C++ std::vector
of vectors is not ideal. Initializing your int *rows[M]
just takes a loop of &A[i][0]
, so it's just math, not multiple loads or allocations.
Accessing the array through this indirection table replaces i*N + j
math with pointer-chasing: load rows[i]
, then index that with j
.
When you don't need the shuffled view of the array, you can access it directly, but if you want to be able to make permanent shuffles to the array, all users of it always have to go through the indirection layer.
Upvotes: 1
Reputation: 249093
/* M is the number of rows; N is the number of columns. */
void matrix_shift(int M, int N, int A[M][N]) {
size_t rowbytes = N * sizeof(int);
int temprow[N];
memcpy(temprow, A, rowbytes); // store first row
memmove(A, A + 1, (M-1) * rowbytes); // shift up
memcpy(A + (M-1), temprow, rowbytes); // replace last row
}
This keeps it simple and relies on routines which should be highly optimized on any common platform. There is one extra row copied, but this is a minor inefficiency in the stated case of a square matrix.
Upvotes: 2