Reputation: 133
I have two YMM registers which have values v20={ b128, a128 } and v31={ d128, c128 } I need to write those registers into memory but in following sequence: a128, c128, b128, d128
I wrote code which seem works but not as fast as I want
v10 = _mm256_permute2f128_si256(v20, v31, 0x20);
v32 = _mm256_permute2f128_si256(v20, v31, 0x31);
_mm256_store_si256((__m256i*)bufptr , v10);
_mm256_store_si256((__m256i*)bufptr+1, v32);
Idea is to interleave 2 YMM register halfs into single ZMM register then write ZMM into memory.
Problem is that I cannot find intrinsics which can do that. How can I do that? What is best way? Would it finally give any performance benefit?
Another issue is that I am trying to migrate from XMM code to YMM code but performance win I already have is very very low. And this confuses me.
Upvotes: 0
Views: 43