Reputation: 23
I am trying to learn more about auto vectorization in gcc. In my project I have to use gcc 4.8.5 and I have some loops that i see that are not vectorized. Thus I have created a small example to play and to see why they are not.
What I am interested in is the fact that gcc does not vectorize the loop and to find out how I can vectorize it. Unfortunately I am not very familiar with the output messages of GCC.
a) I would expect that this loop would be vectorized as a trivial case
b) Is there anything trivial that I am missing?
Thank you all very much in advance ...
The small example is:
#include <iostream>
#include <vector>
using namespace std;
class test
{
public:
test();
~test();
void calc_test();
};
test::test()
{
}
test::~test()
{
}
void
test::calc_test(void)
{
vector<int> ffs_psd(10000,5.0);
vector<int> G_qh_sp(10000,1.0);
vector<int> G_qv_sp(10000,3.0);
vector<int> B_erm_qh(10000,50.0);
vector<int> B_erm_qv(10000,2.0);
for ( uint ang=0; ang < 6808; ang++)
{
ffs_psd[0] += (G_qh_sp[ang] * B_erm_qh[ang]) + (G_qv_sp[ang] * B_erm_qv[ang]);
}
}
int main(int argc, char * argv[])
{
test m_test;
m_test.calc_test();
}
I compile it with gcc 4.8.5 :
c++ -O3 -ftree-vectorize -fopt-info-vec-missed -ftree-vectorizer-verbose=5 -std=c++11 test.cpp
The output that I get from the compiler is:
test.cpp:34: note: ===vect_slp_analyze_bb===
test.cpp:34: note: === vect_analyze_data_refs ===
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: === vect_pattern_recog ===
test.cpp:34: note: vect_is_simple_use: operand _27
test.cpp:34: note: def_stmt: _27 = (long unsigned int) ang_212;
test.cpp:34: note: type of def: 3.
test.cpp:34: note: vect_is_simple_use: operand ang_212
test.cpp:34: note: def_stmt: ang_212 = PHI <ang_43(78), 0(76)>
test.cpp:34: note: type of def: 2.
test.cpp:34: note: vect_is_simple_use: operand 4
test.cpp:34: note: vect_recog_widen_mult_pattern: detected:
test.cpp:34: note: get vectype with 4 units of type uint
test.cpp:34: note: vectype: vector(4) unsigned int
test.cpp:34: note: get vectype with 2 units of type long unsigned int
test.cpp:34: note: vectype: vector(2) long unsigned int
test.cpp:34: note: patt_2 = ang_212 w* 4;
test.cpp:34: note: pattern recognized: patt_2 = ang_212 w* 4;
test.cpp:34: note: vect_is_simple_use: operand _29
test.cpp:34: note: def_stmt: _29 = *_67;
test.cpp:34: note: type of def: 3.
test.cpp:34: note: vect_is_simple_use: operand _34
test.cpp:34: note: def_stmt: _34 = *_69;
test.cpp:34: note: type of def: 3.
test.cpp:34: note: === vect_analyze_dependences ===
test.cpp:34: note: can't determine dependence between *_67 and MEM[(value_type &)__first_111]
test.cpp:34: note: can't determine dependence between *_68 and MEM[(value_type &)__first_111]
test.cpp:34: note: can't determine dependence between *_69 and MEM[(value_type &)__first_111]
test.cpp:34: note: can't determine dependence between *_70 and MEM[(value_type &)__first_111]
test.cpp:34: note: === vect_analyze_data_refs_alignment ===
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: SLP: step doesn't divide the vector-size.
test.cpp:34: note: Unknown alignment for access: *__first_125
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: SLP: step doesn't divide the vector-size.
test.cpp:34: note: Unknown alignment for access: *__first_153
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: SLP: step doesn't divide the vector-size.
test.cpp:34: note: Unknown alignment for access: *__first_139
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: SLP: step doesn't divide the vector-size.
test.cpp:34: note: Unknown alignment for access: *__first_167
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: can't force alignment of ref: MEM[(value_type &)__first_111]
test.cpp:34: note: === vect_analyze_data_ref_accesses ===
test.cpp:34: note: not consecutive access MEM[(value_type &)__first_111] = _41;
test.cpp:34: note: === vect_analyze_slp ===
test.cpp:34: note: Failed to SLP the basic block.
test.cpp:34: note: not vectorized: failed to find SLP opportunities in basic block.
EDIT : After Matts answer below:
@Matt :
Thanks a lot for your answer. I did not know that the vector is not aligned. This information is very useful because many people would just take as granted that a loop will be vectorized even if they use a vector as a container.
Unfortunately even with your changes the report from gcc is that still is not vectorized (with different messages this time):
test.cpp:47: note: misalign = 0 bytes of ref MEM[(value_type &)&ffs_psd]
test.cpp:47: note: not consecutive access _25 = MEM[(value_type &)&ffs_psd];
test.cpp:47: note: Failed to SLP the basic block.
test.cpp:47: note: not vectorized: failed to find SLP opportunities in basic block.
test.cpp:47: note: misalign = 0 bytes of ref MEM[(value_type &)&ffs_psd]
test.cpp:47: note: not consecutive access _25 = MEM[(value_type &)&ffs_psd];
test.cpp:47: note: Failed to SLP the basic block.
test.cpp:47: note: not vectorized: failed to find SLP opportunities in basic block.
The assembly output is (hopefully I copy paste the correct section cause my assembly knowledge is not very good) :
.L16
vmovdqa 40000(%rsp,%rax), %ymm1
vmovdqa 80000(%rsp,%rax), %ymm0
vpmulld 120000(%rsp,%rax), %ymm1, %ymm1
vpmulld 160000(%rsp,%rax), %ymm0, %ymm0
vpaddd %ymm0, %ymm1, %ymm0
vpaddd (%rsp,%rax), %ymm0, %ymm0
vmovdqa %ymm0, (%rsp,%rax)
addq $32, %rax
cmpq $27232, %rax
jne .L16
Upvotes: 1
Views: 648
Reputation: 2802
In order to use vectorized instructions the operands need to be aligned along the proper boundaries. For example __attribute__((aligned(32)))
or __attribute__((aligned(16)))
etc. The standard allocator for std::vector
does not guarantee alignment even if the class is aligned. For example std::vector<__m64> A
creates a vector of SSE data types but they may not be aligned because std::allocator
doesn't align everything. In my opinion the simplest change is to use a std::array
with __attribute__((aligned(32)))
#include <iostream>
#include <array>
using namespace std;
int main()
{
array<int, 10000> ffs_psd __attribute__((aligned(32)));
ffs_psd.fill(5);
array<int, 10000> G_qh_sp __attribute__((aligned(32)));
G_qh_sp.fill(1);
array<int, 10000> G_qv_sp __attribute__((aligned(32)));
G_qv_sp.fill(3);
array<int, 10000> B_erm_qh __attribute__((aligned(32)));
B_erm_qh.fill(50);
array<int, 10000> B_erm_qv __attribute__((aligned(32)));
B_erm_qv.fill(2);
for ( uint ang=0; ang < 6808; ang++)
{
ffs_psd[0] += (G_qh_sp[ang] * B_erm_qh[ang]) + (G_qv_sp[ang] * B_erm_qv[ang]);
}
cout << ffs_psd[0] << endl;
}
The loop produces this:
vmovdqa ymm2, YMMWORD PTR [rsp+40000+rax]
vmovdqa ymm1, YMMWORD PTR [rsp+80000+rax]
vpmulld ymm2, ymm2, YMMWORD PTR [rsp+120000+rax]
vpmulld ymm1, ymm1, YMMWORD PTR [rsp+160000+rax]
add rax, 32
vpaddd ymm1, ymm2, ymm1
cmp rax, 27232
vpaddd ymm0, ymm0, ymm1
jne .L13
vmovdqa xmm1, xmm0
on Godbolt with GCC 4.8.3 -std=c++11 -Wall -Wextra -pedantic-errors -O2 -ftree-vectorize -march=native
Another option is to use boost::alignment::aligned_allocator
with your vector.
Finally you can write your own allocator
that vector
can use to properly align things. Here is an article explaining the requirements for an allocator. Also here is a SO question about the same basic thing.
Upvotes: 1