Reputation: 413
So far, I've got the following code.
int actualizar_tablero(int *tablero, int *tab_aux, unsigned Y, unsigned X, unsigned YX) {
int count = 0;
// Ensure memory alignment
int *aligned_tablero = __builtin_assume_aligned(tablero, 16);
int *aligned_tab_aux = __builtin_assume_aligned(tab_aux, 16);
for (unsigned i = 0; i < Y; ++i) {
unsigned im1 = (i == 0) ? (Y - 1) : (i - 1);
unsigned ip1 = (i == Y - 1) ? 0 : (i + 1);
for (unsigned j = 0; j < X; ++j) {
unsigned jm1 = (j == 0) ? (X - 1) : (j - 1);
unsigned jp1 = (j == X - 1) ? 0 : (j + 1);
int vecinos = aligned_tablero[im1 * X + jm1] + aligned_tablero[im1 * X + j] +
aligned_tablero[im1 * X + jp1] + aligned_tablero[i * X + jm1] +
aligned_tablero[i * X + jp1] + aligned_tablero[ip1 * X + jm1] +
aligned_tablero[ip1 * X + j] + aligned_tablero[ip1 * X + jp1];
int cell = aligned_tablero[i * X + j];
int new_val = (cell && ((vecinos == 2) || (vecinos == 3))) || (!cell && (vecinos == 3));
aligned_tab_aux[i * X + j] = new_val;
count += (new_val != cell);
}
}
// Copy back to tablero
for (unsigned i = 0; i < YX; ++i) {
aligned_tablero[i] = aligned_tab_aux[i];
}
return count;
}
The total execution time is 29.3s. __builtin_assume_aligned
improves performance by about 0.6s (29.9s->29.3s) on the target CPU. Any small improvements count, even if it is just 1%, I'm trying to get the most out of this code as possible.
I've thought about using more vectorization and possible register reuse. I'm guessing the conditionals on im1
, ip1
, jm1
, jp1
might have something to do with the code not being properly vectorized. Note, there is already some vectorization going on, but, extremely minor:
3,93 │278: movdqu 0x0(%r13,%rdx,1),%xmm0
0,50 │ movups %xmm0,(%rax,%rdx,1)
1,04 │ add $0x10,%rdx
1,52 │ cmp 0x20(%rsp),%rdx
│ ↑ jne 278
(perf report output, I dont really even know if this part of the code actually belongs to the function mentioned, I'm very new to this. It seems likely because of the forced __builtin_assume_aligned
)
I thought of separating the edge conditions i=0
, j=0
, i=Y-1
, j=X-1
into their own loop, in order to avoid checking this comparisons over and over again on the loop. But before doing so, I would like to know if this strategy is actually going to cause any improvements, since it is going to take a while doing it manually.
(edit: its actually not 6 code blocks, but 3*3=9)
I'd love to know your opinion about doing this optimization, so I know if it is worth taking my time. Feel free to suggest any other optimizations for this code :)!
The code will always be ran on the same CPU, an Intel Xeon E5645.
Compiler flags are: -Ofast -march=native -ffast-math
, but more can be added if needed.
Upvotes: 0
Views: 103