Reputation: 463
I have a function that gets a 3 x 3
matrix and a 3 x 4000
vector
, and multiplies them.
All the calculation are done in double precision (64-bit).
The function is called about 3.5
million times so it should be optimized.
#define MATRIX_DIM 3
#define VECTOR_LEN 3000
typedef struct {
double a;
double b;
double c;
} vector_st;
double matrix[MATRIX_DIM][MATRIX_DIM];
vector_st vector[VACTOR_LEN];
inline void rotate_arr(double input_matrix[][MATRIX_DIM], vector_st *input_vector, vector_st *output_vector)
{
int i;
for (i = 0; i < VACTOR_LEN; i++) {
op_rotate_preset_arr[i].a = input_matrix[0][0] * input_vector[i].a +
input_matrix[0][1] * input_vector[i].b +
input_matrix[0][2] * input_vector[i].c;
op_rotate_preset_arr[i].b = input_matrix[1][0] * input_vector[i].a +
input_matrix[1][1] * input_vector[i].b +
input_matrix[1][2] * input_vector[i].c;
op_rotate_preset_arr[i].c = input_matrix[2][0] * input_vector[i].a +
input_matrix[2][1] * input_vector[i].b +
input_matrix[2][2] * input_vector[i].c;
}
}
I all out of ideas on how to optimize it because it's inline
, data access is sequential and the function is short and pretty straight-forward.
It can be assumed that the vector is always the same and only the matrix is changing if it will boost performance.
Upvotes: 1
Views: 636
Reputation: 64904
One easy to fix problem here is that compilers assumes that the matrix and the output vectors may alias. As seen here in the second function, that causes code to be generated that is less efficient and significantly larger. This can be fixed simply by adding restrict
to the output pointer. Doing only this already helps and keeps the code free from platform specific optimization, but relies on auto-vectorization in order to use the performance increases that have happened in the past two decades.
Auto-vectorization is evidently still too immature for the task, both Clang and GCC generate way too much shuffling around of the data. This should improve in future compilers, but for now even a case like this (that doesn't seem inherently super hard) needs manual help, such as this (not tested though)
void rotate_arr_avx(double input_matrix[][MATRIX_DIM], vector_st *input_vector, vector_st * restrict output_vector)
{
__m256d col0, col1, col2, a, b, c, t;
int i;
// using set macros like this is kind of dirty, but it's outside the loop anyway
col0 = _mm256_set_pd(0.0, input_matrix[2][0], input_matrix[1][0], input_matrix[0][0]);
col1 = _mm256_set_pd(0.0, input_matrix[2][1], input_matrix[1][1], input_matrix[0][1]);
col2 = _mm256_set_pd(0.0, input_matrix[2][2], input_matrix[1][2], input_matrix[0][2]);
for (i = 0; i < VECTOR_LEN; i++) {
a = _mm256_set1_pd(input_vector[i].a);
b = _mm256_set1_pd(input_vector[i].b);
c = _mm256_set1_pd(input_vector[i].c);
t = _mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(col0, a), _mm256_mul_pd(col1, b)), _mm256_mul_pd(col2, c));
// this stores an element too much, ensure 8 bytes of padding exist after the array
_mm256_storeu_pd(&output_vector[i].a, t);
}
}
Writing it this way significantly improves what compilers do with it, now compiling to a nice and tight loop without all the nonsense. Earlier the code hurt to look at, but with this the loop now looks like this (GCC 8.1, with FMA enabled), which is actually readable:
.L2:
vbroadcastsd ymm2, QWORD PTR [rsi+8+rax]
vbroadcastsd ymm1, QWORD PTR [rsi+16+rax]
vbroadcastsd ymm0, QWORD PTR [rsi+rax]
vmulpd ymm2, ymm2, ymm4
vfmadd132pd ymm1, ymm2, ymm3
vfmadd132pd ymm0, ymm1, ymm5
vmovupd YMMWORD PTR [rdx+rax], ymm0
add rax, 24
cmp rax, 72000
jne .L2
This has an obvious deficiency: only 3 of the 4 double precision slots of the 256bit AVX vectors are actually used. If the data format of the vector was changed to for example AAAABBBBCCCC repeating, a totally different approach could be used, namely broadcasting the matrix elements instead of the vector elements, then multiplying the broadcasted matrix element by the A component of 4 different vector_st
s at once.
An other thing we can try, without even changing the data format, is processing more than one matrix at the same time, which helps to re-use loads from the input_vector
to increase arithmetic intensity.
void rotate_arr_avx(double input_matrixA[][MATRIX_DIM], double input_matrixB[][MATRIX_DIM], vector_st *input_vector, vector_st * restrict output_vectorA, vector_st * restrict output_vectorB)
{
__m256d col0A, col1A, col2A, a, b, c, t, col0B, col1B, col2B;
int i;
// using set macros like this is kind of dirty, but it's outside the loop anyway
col0A = _mm256_set_pd(0.0, input_matrixA[2][0], input_matrixA[1][0], input_matrixA[0][0]);
col1A = _mm256_set_pd(0.0, input_matrixA[2][1], input_matrixA[1][1], input_matrixA[0][1]);
col2A = _mm256_set_pd(0.0, input_matrixA[2][2], input_matrixA[1][2], input_matrixA[0][2]);
col0B = _mm256_set_pd(0.0, input_matrixB[2][0], input_matrixB[1][0], input_matrixB[0][0]);
col1B = _mm256_set_pd(0.0, input_matrixB[2][1], input_matrixB[1][1], input_matrixB[0][1]);
col2B = _mm256_set_pd(0.0, input_matrixB[2][2], input_matrixB[1][2], input_matrixB[0][2]);
for (i = 0; i < VECTOR_LEN; i++) {
a = _mm256_set1_pd(input_vector[i].a);
b = _mm256_set1_pd(input_vector[i].b);
c = _mm256_set1_pd(input_vector[i].c);
t = _mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(col0A, a), _mm256_mul_pd(col1A, b)), _mm256_mul_pd(col2A, c));
// this stores an element too much, ensure 8 bytes of padding exist after the array
_mm256_storeu_pd(&output_vectorA[i].a, t);
t = _mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(col0B, a), _mm256_mul_pd(col1B, b)), _mm256_mul_pd(col2B, c));
_mm256_storeu_pd(&output_vectorB[i].a, t);
}
}
Upvotes: 1