user6137121
user6137121

Reputation:

Why gcc compile _mm256_permute2f128_ps to Vinsertf128 instruction?

This instruction is a part of an assembly out put of a C program (gcc -O2). According to the result I understand that ymm6 is source operand 1 that all of it, is cloned to ymm9 and then xmm1 is cloned to the ymm6[127-256] I read Intel manual but it uses Intel assembly syntax not At&t and I don't want to use Intel syntax. So ymm8, ymm2 and ymm6 here is SRC1. is this true?

vshufps     $68,  %ymm0, %ymm8, %ymm6
vshufps     $68,  %ymm4, %ymm2, %ymm1
Vinsertf128 $1,  %xmm1, %ymm6, %ymm9

And the main question is why gcc changed the instruction

row0 = _mm256_permute2f128_ps(__tt0, __tt4, 0x20);

to

Vinsertf128 $1,  %xmm1, %ymm6, %ymm9

and

row4 = _mm256_permute2f128_ps(__tt0, __tt4, 0x31);

to

Vperm2f128  $49, %ymm1, %ymm6, %ymm1

How could I ignore this optimization? I tried -O0 but doesn't work.

Upvotes: 1

Views: 933

Answers (2)

ADMS
ADMS

Reputation: 108

Examination of the instructions that bind to port 5 in the instruction analysis report shows that the instructions were broadcasts and vpermilps. The broadcasts can only execute on port 5, but replacing them with 128-bit loads followed by vinsertf128 instructions reduces the pressure on port 5 because vinsertf128 can execute on port 0. from IACA user guid

Upvotes: 1

Peter Cordes
Peter Cordes

Reputation: 363980

So ymm8, ymm2 and ymm6 here is SRC1. is this true?

Yes, the middle operand is always src1 in a 3-operand instruction in both syntaxes.

  • AT&T: op %src2, %src1, %dest
  • Intel: op dest, src1, src2

I don't want to use Intel syntax

Tough. The only really good documentation I know of for exactly what every instruction does is the Intel insn ref manual. I used to think AT&T syntax was better, because the $ and % decorators remove ambiguity. I do like that, but otherwise prefer the Intel syntax now. The rules for each are simple enough that you can easily mentally convert, or "think" in whichever one you're reading ATM.

Unless you're actually writing GNU C inline asm, you can just use gcc -masm=intel and objdump -Mintel to get GNU-flavoured asm using intel mnemonics, operand order, and so on. The assembler directives are still gas style, not NASM. Use http://gcc.godbolt.org/ to get nicely-formatted asm output for code with only the essential labels left in.


gcc and clang both have some understanding of what the intrinsics actually do, so internally they translate the intrinsic to some data movement. When it comes time to emit code, they see that said data movement can be done with vinsertf128, so they emit that.

On some CPUs (Intel SnB-family), both instructions have equal performance, but on AMD Bulldozer-family (which only has 128b ALUs), vinsertf128 is much faster than vperm2f128. (source: see Agner Fog's guides, and other links at the tag wiki). They both take 6 bytes to encode, including the immediate, so there's no code-size difference. vinsertf128 is always a better choice than a vperm2f128 that does identical data movement.

gcc and clang don't have a "literal translation of intrinsics to instructions" mode, because it would take extra work to implement. If you care exactly which instructions the compiler uses, that's what inline asm is for.

Keep in mind that -O0 doesn't mean "no optimization". It still has to transform through a couple internal representations before emitting asm.

Upvotes: 4

Related Questions