Ken Y-N
Ken Y-N

Reputation: 15009

Simulating AVX-512 mask instructions

According to the documentation, from gcc 4.9 on the AVX-512 instruction set is supported, but I have gcc 4.8. I currently have code like this for summing up a block of memory (it's guaranteed to be less than 256 bytes, so no overflow worries):

__mm128i sum = _mm_add_epi16(sum, _mm_cvtepu8_epi16(*(__m128i *) &mem));

Now, looking through the documentation, if we have, say, four bytes left over, I could use:

__mm128i sum = _mm_add_epi16(sum,
                             _mm_mask_cvtepu8_epi16(_mm_set1_epi16(0),
                                                    (__mmask8)_mm_set_epi16(0,0,0,0,1,1,1,1),
                                                    *(__m128i *) &mem));

(Note, the type of __mmask8 doesn't seem to be documented anywhere I can find, so I am guessing...)

However, _mm_mask_cvtepu8_epi16 is an AVX-512 instruction, so is there a way to duplicate this? I tried:

mm_mullo_epi16(_mm_set_epi16(0,0,0,0,1,1,1,1),
               _mm_cvtepu8_epi16(*(__m128i *) &mem));

However, there was a cache stall so just a direct for (int i = 0; i < remaining_bytes; i++) sum += mem[i]; gave better performance.

Upvotes: 1

Views: 1055

Answers (1)

zinga
zinga

Reputation: 789

As I happened to stumble across this question, and it still hasn't gotten an answer, if this is still a problem...

For your example problem, you're on the right track.

  • Multiply is a relatively slow operation, so you should avoid the use of _mm_mullo_epi16. Use _mm_and_si128 instead as bitwise AND is a much faster operation, e.g. _mm_and_si128(_mm_cvtepu8_epi16(*(__m128i *) &mem), _mm_set_epi32(0, 0, -1, -1))
  • I'm not sure what you mean by a cache stall, but if memory access is a bottleneck, and the compiler won't put the constant for the above into a register, you could use something like _mm_srli_si128(vector, 8) which doesn't need any additional registers/memory loads. A shift may be slower than an AND.
  • If it's always 8 bytes, you can use _mm_move_epi64
  • None of this solves the case if the remaining number isn't a fixed number of elements (e.g. you have n%16 bytes for some arbitrary n). Note that AVX-512 doesn't really solve it either. If you need to deal with this case, you could have a table of masks and AND depending on what's remaining, e.g. _mm_and_si128(vector, masks[n & 0xf])
  • (_mm_mask_cvtepu8_epi16 only cares about the low half of the vector, so your example is somewhat confusing - that is, you don't need to mask anything because the later elements are completely ignored anway)

On a more generic level, mask operations are really just an embedded _mm_blend_epi16 (or equivalent). For zeroing idioms, they can easily be emulated with _mm_and_si128 / _mm_andnot_si128, as shown above.

Upvotes: 2

Related Questions