Reputation:
This instruction is a part of an assembly out put of a C program (gcc -O2). According to the result I understand that ymm6
is source operand 1
that all of it, is cloned to ymm9
and then xmm1
is cloned to the ymm6[127-256]
I read Intel manual but it uses Intel assembly syntax not At&t and I don't want to use Intel syntax. So ymm8
, ymm2
and ymm6
here is SRC1
. is this true?
vshufps $68, %ymm0, %ymm8, %ymm6
vshufps $68, %ymm4, %ymm2, %ymm1
Vinsertf128 $1, %xmm1, %ymm6, %ymm9
And the main question is why gcc
changed the instruction
row0 = _mm256_permute2f128_ps(__tt0, __tt4, 0x20);
to
Vinsertf128 $1, %xmm1, %ymm6, %ymm9
and
row4 = _mm256_permute2f128_ps(__tt0, __tt4, 0x31);
to
Vperm2f128 $49, %ymm1, %ymm6, %ymm1
How could I ignore this optimization? I tried -O0
but doesn't work.
Upvotes: 1
Views: 933
Reputation: 108
Examination of the instructions that bind to port 5 in the instruction analysis report shows that the instructions were broadcasts
and vpermilps
. The broadcasts
can only execute on port 5, but replacing them with 128-bit
loads followed by vinsertf128
instructions reduces the pressure on port 5 because vinsertf128
can execute on port 0. from IACA user guid
Upvotes: 1
Reputation: 363980
So ymm8, ymm2 and ymm6 here is SRC1. is this true?
Yes, the middle operand is always src1 in a 3-operand instruction in both syntaxes.
op %src2, %src1, %dest
op dest, src1, src2
I don't want to use Intel syntax
Tough. The only really good documentation I know of for exactly what every instruction does is the Intel insn ref manual. I used to think AT&T syntax was better, because the $ and % decorators remove ambiguity. I do like that, but otherwise prefer the Intel syntax now. The rules for each are simple enough that you can easily mentally convert, or "think" in whichever one you're reading ATM.
Unless you're actually writing GNU C inline asm, you can just use gcc -masm=intel
and objdump -Mintel
to get GNU-flavoured asm using intel mnemonics, operand order, and so on. The assembler directives are still gas
style, not NASM. Use http://gcc.godbolt.org/ to get nicely-formatted asm output for code with only the essential labels left in.
gcc and clang both have some understanding of what the intrinsics actually do, so internally they translate the intrinsic to some data movement. When it comes time to emit code, they see that said data movement can be done with vinsertf128
, so they emit that.
On some CPUs (Intel SnB-family), both instructions have equal performance, but on AMD Bulldozer-family (which only has 128b ALUs), vinsertf128
is much faster than vperm2f128
. (source: see Agner Fog's guides, and other links at the x86 tag wiki). They both take 6 bytes to encode, including the immediate, so there's no code-size difference. vinsertf128
is always a better choice than a vperm2f128
that does identical data movement.
gcc and clang don't have a "literal translation of intrinsics to instructions" mode, because it would take extra work to implement. If you care exactly which instructions the compiler uses, that's what inline asm is for.
Keep in mind that -O0
doesn't mean "no optimization". It still has to transform through a couple internal representations before emitting asm.
Upvotes: 4