Reputation: 15040
I use _mm256_cvtps_epi32()
to convert from 8 float
s to 8x32-bit integers. But the goal is to get to 16-bit unsigned integers. I have 2 vectors a0
and a1
, each of __m256i
type. What is the fastest way to pack them so that 16-bit equivalents of a0
get into the lower 128 bits of the result, and equivalents of a1
get into the higher 128 bits?
Here's what I've got so far, where p0
and p1
are two __m256
vectors of 8 float
s each:
const __m256i vShuffle = _mm256_setr_epi8(
0, 1, 4, 5, 8, 9, 12, 13, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 4, 5, 8, 9, 12, 13);
const __m256i a0 = _mm256_cvtps_epi32(p0);
const __m256i a1 = _mm256_cvtps_epi32(p1);
const __m256i b0 = _mm256_shuffle_epi8(a0, vShuffle);
const __m256i b1 = _mm256_shuffle_epi8(a1, vShuffle);
const __m128i c0 = _mm_or_si128(_mm256_extracti128_si256(b0, 0), _mm256_extracti128_si256(b0, 1));
const __m128i c1 = _mm_or_si128(_mm256_extracti128_si256(b1, 0), _mm256_extracti128_si256(b1, 1));
return _mm256_setr_m128i(c0, c1);
Upvotes: 2
Views: 1783
Reputation: 1344
I didn't test that code but it should do the trick for you:
__m256i tmp1 = _mm256_cvtps_epi32(p0);
__m256i tmp2 = _mm256_cvtps_epi32(p1);
tmp1 = _mm256_packus_epi32(tmp1, tmp2);
tmp1 = _mm256_permute4x64_epi64(tmp1, 0xD8);
// _mm256_store_si256 this
Upvotes: 3