Reputation: 21615
I am learning to program with AVX. So, I wrote a simple program to multiply matrices of size 4. While with no compiler optimizations, the AVX version is slightly faster than the non-AVX version, with O3 optimization, the non-AVX version becomes almost twice as fast as the AVX version. Any tip on how can I improve the performance of the AVX version? Following is the full code.
#include <immintrin.h>
#include <stdio.h>
#include <stdlib.h>
#define MAT_SIZE 4
#define USE_AVX
double A[MAT_SIZE][MAT_SIZE];
double B[MAT_SIZE][MAT_SIZE];
double C[MAT_SIZE][MAT_SIZE];
union {
double m[4][4];
__m256d row[4];
} matB;
void init_matrices()
{
for(int i = 0; i < MAT_SIZE; i++)
for(int j = 0; j < MAT_SIZE; j++)
{
A[i][j] = (float)(i+j);
B[i][j] = (float)(i+j+1);
matB.m[i][j] = B[i][j];
}
}
void print_result()
{
for(int i = 0; i < MAT_SIZE; i++)
{
for(int j = 0; j < MAT_SIZE; j++)
{
printf("%.1f\t", C[i][j]);
}
printf("\n");
}
}
void withoutAVX()
{
for(int row = 0; row < MAT_SIZE; row++)
for(int col = 0; col < MAT_SIZE; col++)
{
float sum = 0;
for(int e = 0; e < MAT_SIZE; e++)
sum += A[row][e] * B[e][col];
C[row][col] = sum;
}
}
void withAVX()
{
for(int row = 0; row < 4; row++)
{
//calculate_resultant_row(row);
const double* rowA = (const double*)&A[row];
__m256d* pr = (__m256d*)(&C[row]);
*pr = _mm256_mul_pd(_mm256_broadcast_sd(&rowA[0]), matB.row[0]);
for(int i = 1; i < 4; i++)
*pr = _mm256_add_pd(*pr, _mm256_mul_pd(_mm256_broadcast_sd(&rowA[i]),
matB.row[i]));
}
}
static __inline__ unsigned long long rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
int main()
{
init_matrices();
// start timer
unsigned long long cycles = rdtsc();
#ifdef USE_AVX
withAVX();
#else
withoutAVX();
#endif
// stop timer
cycles = rdtsc() - cycles;
printf("\nTotal time elapsed : %ld\n\n", cycles);
print_result();
return 0;
}
Upvotes: 0
Views: 781
Reputation: 2767
It's hard to be sure without knowing exactly what compiler and system you are using. You need to check the assembly of generated code to be sure. Below are merely some possible reasons.
The compiler probably generated extra load/store. This will cost.
The inner most loop broadcast elements from A. And thus you have extra loads. The optimal code shall require only 8 loads, 4 for A and B each, and 4 store back in C. However your code will lead to at least 16 extra loads because your use of broadcastsd. These will cost you as much as the computation itself, and probably more.
Edit (too long for comments)
There are situations where compiler won't be able to do smart optimization or sometime it is "too clever" for good. Recently I even had need to use assembly to avoid compiler optimization which actually lead to bad code! That said, if what you need is performance and you don't really care how you get there. I would suggest you first look for good libraries. For example, Eigen for linear algebra will fit your need in this example perfectly. If you do want to learn SIMD programming, I suggest you start with simpler cases, such as adding two vectors. Most likely, you will find that compiler will be able to generate better vectorized binary than your first few attempts. But they are more straightforward and thus you will see where you need improvement more easily. You will learn all kinds of things that you need to write optimal code in the process of attempting to produce code as good as or better than that can be generated by a compiler. And eventually you will be able to provide optimal implementations to code that compiler cannot optimize. One thing you need to keep in mind is that the lower level you go, the less compiler can do for you. You will have more control over what binaries are generated, but it is also your responsibility to make them optimal. These advices are pretty vague. Sorry cannot be of more help.
Upvotes: 3