Reputation: 15009
According to the documentation, from gcc 4.9
on the AVX-512
instruction set is supported, but I have gcc 4.8
. I currently have code like this for summing up a block of memory (it's guaranteed to be less than 256 bytes, so no overflow worries):
__mm128i sum = _mm_add_epi16(sum, _mm_cvtepu8_epi16(*(__m128i *) &mem));
Now, looking through the documentation, if we have, say, four bytes left over, I could use:
__mm128i sum = _mm_add_epi16(sum,
_mm_mask_cvtepu8_epi16(_mm_set1_epi16(0),
(__mmask8)_mm_set_epi16(0,0,0,0,1,1,1,1),
*(__m128i *) &mem));
(Note, the type of __mmask8
doesn't seem to be documented anywhere I can find, so I am guessing...)
However, _mm_mask_cvtepu8_epi16
is an AVX-512
instruction, so is there a way to duplicate this? I tried:
mm_mullo_epi16(_mm_set_epi16(0,0,0,0,1,1,1,1),
_mm_cvtepu8_epi16(*(__m128i *) &mem));
However, there was a cache stall so just a direct for (int i = 0; i < remaining_bytes; i++) sum += mem[i];
gave better performance.
Upvotes: 1
Views: 1055
Reputation: 789
As I happened to stumble across this question, and it still hasn't gotten an answer, if this is still a problem...
For your example problem, you're on the right track.
_mm_mullo_epi16
. Use _mm_and_si128
instead as bitwise AND is a much faster operation, e.g. _mm_and_si128(_mm_cvtepu8_epi16(*(__m128i *) &mem), _mm_set_epi32(0, 0, -1, -1))
_mm_srli_si128(vector, 8)
which doesn't need any additional registers/memory loads. A shift may be slower than an AND._mm_move_epi64
n%16
bytes for some arbitrary n
). _mm_and_si128(vector, masks[n & 0xf])
_mm_mask_cvtepu8_epi16
only cares about the low half of the vector, so your example is somewhat confusing - that is, you don't need to mask anything because the later elements are completely ignored anway)On a more generic level, mask operations are really just an embedded _mm_blend_epi16
(or equivalent). For zeroing idioms, they can easily be emulated with _mm_and_si128
/ _mm_andnot_si128
, as shown above.
Upvotes: 2