Optimization of matrix and vector multiplication in C

Question

I have a function that gets a 3 x 3 matrix and a 3 x 4000 vector, and multiplies them. All the calculation are done in double precision (64-bit). The function is called about 3.5 million times so it should be optimized.

#define MATRIX_DIM 3
#define VECTOR_LEN 3000

typedef struct {
    double a;
    double b;
    double c;
} vector_st;

double matrix[MATRIX_DIM][MATRIX_DIM];
vector_st vector[VACTOR_LEN];

inline void rotate_arr(double input_matrix[][MATRIX_DIM], vector_st *input_vector, vector_st *output_vector)
{
    int i;
    for (i = 0; i < VACTOR_LEN; i++) {
        op_rotate_preset_arr[i].a = input_matrix[0][0] * input_vector[i].a + 
                                    input_matrix[0][1] * input_vector[i].b +
                                    input_matrix[0][2] * input_vector[i].c;

        op_rotate_preset_arr[i].b = input_matrix[1][0] * input_vector[i].a + 
                                    input_matrix[1][1] * input_vector[i].b +
                                    input_matrix[1][2] * input_vector[i].c;

        op_rotate_preset_arr[i].c = input_matrix[2][0] * input_vector[i].a + 
                                    input_matrix[2][1] * input_vector[i].b +
                                    input_matrix[2][2] * input_vector[i].c;
    }
}

I all out of ideas on how to optimize it because it's inline, data access is sequential and the function is short and pretty straight-forward. It can be assumed that the vector is always the same and only the matrix is changing if it will boost performance.

user555045 · Accepted Answer

One easy to fix problem here is that compilers assumes that the matrix and the output vectors may alias. As seen here in the second function, that causes code to be generated that is less efficient and significantly larger. This can be fixed simply by adding restrict to the output pointer. Doing only this already helps and keeps the code free from platform specific optimization, but relies on auto-vectorization in order to use the performance increases that have happened in the past two decades.

Auto-vectorization is evidently still too immature for the task, both Clang and GCC generate way too much shuffling around of the data. This should improve in future compilers, but for now even a case like this (that doesn't seem inherently super hard) needs manual help, such as this (not tested though)

void rotate_arr_avx(double input_matrix[][MATRIX_DIM], vector_st *input_vector, vector_st * restrict output_vector)
{
    __m256d col0, col1, col2, a, b, c, t;
    int i;
    // using set macros like this is kind of dirty, but it's outside the loop anyway
    col0 = _mm256_set_pd(0.0, input_matrix[2][0], input_matrix[1][0], input_matrix[0][0]);
    col1 = _mm256_set_pd(0.0, input_matrix[2][1], input_matrix[1][1], input_matrix[0][1]);
    col2 = _mm256_set_pd(0.0, input_matrix[2][2], input_matrix[1][2], input_matrix[0][2]);
    for (i = 0; i < VECTOR_LEN; i++) {
        a = _mm256_set1_pd(input_vector[i].a);
        b = _mm256_set1_pd(input_vector[i].b);
        c = _mm256_set1_pd(input_vector[i].c);
        t = _mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(col0, a), _mm256_mul_pd(col1, b)), _mm256_mul_pd(col2, c));
        // this stores an element too much, ensure 8 bytes of padding exist after the array
        _mm256_storeu_pd(&output_vector[i].a, t);
    }
}

Writing it this way significantly improves what compilers do with it, now compiling to a nice and tight loop without all the nonsense. Earlier the code hurt to look at, but with this the loop now looks like this (GCC 8.1, with FMA enabled), which is actually readable:

.L2:
    vbroadcastsd    ymm2, QWORD PTR [rsi+8+rax]
    vbroadcastsd    ymm1, QWORD PTR [rsi+16+rax]
    vbroadcastsd    ymm0, QWORD PTR [rsi+rax]
    vmulpd  ymm2, ymm2, ymm4
    vfmadd132pd     ymm1, ymm2, ymm3
    vfmadd132pd     ymm0, ymm1, ymm5
    vmovupd YMMWORD PTR [rdx+rax], ymm0
    add     rax, 24
    cmp     rax, 72000
    jne     .L2

This has an obvious deficiency: only 3 of the 4 double precision slots of the 256bit AVX vectors are actually used. If the data format of the vector was changed to for example AAAABBBBCCCC repeating, a totally different approach could be used, namely broadcasting the matrix elements instead of the vector elements, then multiplying the broadcasted matrix element by the A component of 4 different vector_sts at once.

An other thing we can try, without even changing the data format, is processing more than one matrix at the same time, which helps to re-use loads from the input_vector to increase arithmetic intensity.

void rotate_arr_avx(double input_matrixA[][MATRIX_DIM], double input_matrixB[][MATRIX_DIM], vector_st *input_vector, vector_st * restrict output_vectorA, vector_st * restrict output_vectorB)
{
    __m256d col0A, col1A, col2A, a, b, c, t, col0B, col1B, col2B;
    int i;
    // using set macros like this is kind of dirty, but it's outside the loop anyway
    col0A = _mm256_set_pd(0.0, input_matrixA[2][0], input_matrixA[1][0], input_matrixA[0][0]);
    col1A = _mm256_set_pd(0.0, input_matrixA[2][1], input_matrixA[1][1], input_matrixA[0][1]);
    col2A = _mm256_set_pd(0.0, input_matrixA[2][2], input_matrixA[1][2], input_matrixA[0][2]);
    col0B = _mm256_set_pd(0.0, input_matrixB[2][0], input_matrixB[1][0], input_matrixB[0][0]);
    col1B = _mm256_set_pd(0.0, input_matrixB[2][1], input_matrixB[1][1], input_matrixB[0][1]);
    col2B = _mm256_set_pd(0.0, input_matrixB[2][2], input_matrixB[1][2], input_matrixB[0][2]);
    for (i = 0; i < VECTOR_LEN; i++) {
        a = _mm256_set1_pd(input_vector[i].a);
        b = _mm256_set1_pd(input_vector[i].b);
        c = _mm256_set1_pd(input_vector[i].c);
        t = _mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(col0A, a), _mm256_mul_pd(col1A, b)), _mm256_mul_pd(col2A, c));
        // this stores an element too much, ensure 8 bytes of padding exist after the array
        _mm256_storeu_pd(&output_vectorA[i].a, t);
        t = _mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(col0B, a), _mm256_mul_pd(col1B, b)), _mm256_mul_pd(col2B, c));
        _mm256_storeu_pd(&output_vectorB[i].a, t);
    }
}

Optimization of matrix and vector multiplication in C

Answers (1)

Related Questions