How to merge two YMM registers into single ZMM but interleave?

Question

I have two YMM registers which have values v20={ b128, a128 } and v31={ d128, c128 } I need to write those registers into memory but in following sequence: a128, c128, b128, d128

I wrote code which seem works but not as fast as I want

v10 = _mm256_permute2f128_si256(v20, v31, 0x20);
v32 = _mm256_permute2f128_si256(v20, v31, 0x31);

_mm256_store_si256((__m256i*)bufptr  , v10);
_mm256_store_si256((__m256i*)bufptr+1, v32);

Idea is to interleave 2 YMM register halfs into single ZMM register then write ZMM into memory.

Problem is that I cannot find intrinsics which can do that. How can I do that? What is best way? Would it finally give any performance benefit?

Another issue is that I am trying to migrate from XMM code to YMM code but performance win I already have is very very low. And this confuses me.

How to merge two YMM registers into single ZMM but interleave?

Answers (0)

Related Questions