quiquelhappy
quiquelhappy

Reputation: 413

C Optimization help needed for SIMD

So far, I've got the following code.

int actualizar_tablero(int *tablero, int *tab_aux, unsigned Y, unsigned X, unsigned YX) {
    int count = 0;

    // Ensure memory alignment
    int *aligned_tablero = __builtin_assume_aligned(tablero, 16);
    int *aligned_tab_aux = __builtin_assume_aligned(tab_aux, 16);

    for (unsigned i = 0; i < Y; ++i) {
        unsigned im1 = (i == 0) ? (Y - 1) : (i - 1);
        unsigned ip1 = (i == Y - 1) ? 0 : (i + 1);

        for (unsigned j = 0; j < X; ++j) {
            unsigned jm1 = (j == 0) ? (X - 1) : (j - 1);
            unsigned jp1 = (j == X - 1) ? 0 : (j + 1);

            int vecinos = aligned_tablero[im1 * X + jm1] + aligned_tablero[im1 * X + j] +
                          aligned_tablero[im1 * X + jp1] + aligned_tablero[i * X + jm1] +
                          aligned_tablero[i * X + jp1] + aligned_tablero[ip1 * X + jm1] +
                          aligned_tablero[ip1 * X + j] + aligned_tablero[ip1 * X + jp1];

            int cell = aligned_tablero[i * X + j];
            int new_val = (cell && ((vecinos == 2) || (vecinos == 3))) || (!cell && (vecinos == 3));
            aligned_tab_aux[i * X + j] = new_val;
            count += (new_val != cell);
        }
    }

    // Copy back to tablero
    for (unsigned i = 0; i < YX; ++i) {
        aligned_tablero[i] = aligned_tab_aux[i];
    }

    return count;
}

The total execution time is 29.3s. __builtin_assume_aligned improves performance by about 0.6s (29.9s->29.3s) on the target CPU. Any small improvements count, even if it is just 1%, I'm trying to get the most out of this code as possible.

I've thought about using more vectorization and possible register reuse. I'm guessing the conditionals on im1, ip1, jm1, jp1 might have something to do with the code not being properly vectorized. Note, there is already some vectorization going on, but, extremely minor:

  3,93 │278:   movdqu 0x0(%r13,%rdx,1),%xmm0
  0,50 │       movups %xmm0,(%rax,%rdx,1)
  1,04 │       add    $0x10,%rdx
  1,52 │       cmp    0x20(%rsp),%rdx
       │     ↑ jne    278

(perf report output, I dont really even know if this part of the code actually belongs to the function mentioned, I'm very new to this. It seems likely because of the forced __builtin_assume_aligned)

I thought of separating the edge conditions i=0, j=0, i=Y-1, j=X-1 into their own loop, in order to avoid checking this comparisons over and over again on the loop. But before doing so, I would like to know if this strategy is actually going to cause any improvements, since it is going to take a while doing it manually.

expected code blocks

(edit: its actually not 6 code blocks, but 3*3=9)

I'd love to know your opinion about doing this optimization, so I know if it is worth taking my time. Feel free to suggest any other optimizations for this code :)!

The code will always be ran on the same CPU, an Intel Xeon E5645.

Compiler flags are: -Ofast -march=native -ffast-math, but more can be added if needed.

Upvotes: 0

Views: 103

Answers (0)

Related Questions