Reputation: 89
The api for shuffling only has support for the byte
and sbyte
//
// Summary:
// __m256i _mm256_shuffle_epi8 (__m256i a, __m256i b)
//
// VPSHUFB ymm, ymm, ymm/m256
//
// Parameters:
// value:
//
// mask:
public static Vector256<sbyte> Shuffle(Vector256<sbyte> value, Vector256<sbyte> mask);
//
// Summary:
// __m256i _mm256_shuffle_epi8 (__m256i a, __m256i b)
//
// VPSHUFB ymm, ymm, ymm/m256
//
// Parameters:
// value:
//
// mask:
public static Vector256<byte> Shuffle(Vector256<byte> value, Vector256<byte> mask);
How would you do a shuffle of other types? For example, say I have a Vector256<short>
and wanted to do a shuffle with a mask of something like [0, 1, 7, 7, 3, 3, 2, 0]
?
Would I have have to instead do it at the byte level? i.e convert the above mask into its byte equivalent?
Upvotes: 0
Views: 99
Reputation: 321
The Avx2 instruction set also supports 32-bit index, and it provides the vpermd (_mm256_permutevar8x32_epi32, Avx2.PermuteVar8x32
) instruction.
The Avx512 family of instruction sets supports 16~64 bit index, and it provides the vpermw (_mm256_permutexvar_epi16, Avx512BW.VL.PermuteVar16x16
), vpermq (_mm256_permutexvar_epi64, Avx512F.VL.PermuteVar4x64
) instructions.
For situations where the Avx512 instruction set is not supported, it is necessary to perform a transformation on the index. Subsequently, vpshufb can be used to implement 16~64 shuffle. The source code is as follows.
private static readonly Vector256<ushort> Shuffle_UInt16_Multiplier = Vector256.Create((ushort)0x202);
private static readonly Vector256<byte> Shuffle_UInt16_ByteOffset = Vector256.Create<byte>(0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1);
public static Vector256<ushort> Shuffle(Vector256<ushort> vector, Vector256<ushort> indices) {
Vector256<ushort> mask, raw, rt;
mask = Avx2.CompareEqual(Avx2.ShiftRightLogical(indices, 4), Vector256<ushort>.Zero); // Unsigned compare: (i < 16)
raw = YShuffleKernel(vector, indices);
rt = Avx2.And(raw, mask);
return rt;
}
public static Vector256<ushort> YShuffleKernel(Vector256<ushort> vector, Vector256<ushort> indices) {
Vector256<byte> indices2 = Avx2.Add(Multiply(indices, Shuffle_UInt16_Multiplier).AsByte(), Shuffle_UInt16_ByteOffset);
return YShuffleKernel(vector.AsByte(), indices2).AsUInt16();
}
private static readonly Vector256<byte> Shuffle_Byte_LaneAdd_K0 = Vector256.Create(0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0);
private static readonly Vector256<byte> Shuffle_Byte_LaneAdd_K1 = Vector256.Create(0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70);
// Cross lane
public static Vector256<byte> YShuffleKernel(Vector256<byte> vector, Vector256<byte> indices) {
// Format: Code; //Latency, Throughput(references IceLake)
Vector256<byte> vector1 = Avx2.Permute4x64(vector.AsInt64(), (byte)0x4E).AsByte(); // 3,1. _MM_SHUFFLE(1, 0, 3, 2) = (1 << 6) | (0 << 4) | (3 << 2) | 2 = 0x4E = 78
Vector256<byte> indices0 = Avx2.Add(indices, Shuffle_Byte_LaneAdd_K0); // 1,0.33
Vector256<byte> indices1 = Avx2.Add(indices, Shuffle_Byte_LaneAdd_K1); // 1,0.33
Vector256<byte> v0 = Avx2.Shuffle(vector, indices0); // 1,0.5
Vector256<byte> v1 = Avx2.Shuffle(vector1, indices1); // 1,0.5
Vector256<byte> rt = Avx2.Or(v0, v1); // 1,0.33
return rt; //total latency: 8, total throughput CPI: 3
}
Note: Avx2.Shuffle
does shuffle in every 128-bit lane. But YShuffleKernel can cross lanes and shuffle the entire vector.
For ease of use, I have developed the VectorTraits library, which has integrated the aforementioned algorithms. Its Shuffle method supports index of 8-64 bit integers and has hardware acceleration on these architectures.
_mm256_shuffle_epi8
and other instructions.vqvtbl1q_u8
instructions.i8x16.swizzle
instructions.NuGet: https://www.nuget.org/packages/VectorTraits (Disclosure: I am the owner of the repo)
Upvotes: 0
Reputation: 64913
Would I have have to instead do it at the byte level? i.e convert the above mask into its byte equivalent?
For a vector of (u)short
, usually yes (but it's more complicated), unless you can use AVX512 (for VPERMW
) or the indexes are lined up in pairs so that you can shuffle it as a vector of (u)int
.
For a vector of (u)int
, there is PermuteVar8x32
, which is generally more convenient anyway.
By the way Vector256.Shuffle
does have an overload to shuffle a vector of shorts, but in my tests at least it just calls some fallback method, so you probably don't want to rely on that.
In general, shuffling a vector of shorts with AVX2 is a bit more of a puzzle than just shuffling it as a vector of bytes - shuffling a vector of bytes is in general more complicated than calling Avx2.Shuffle
, which is really the issue here. Avx2.Shuffle
is part of the solution, but VPSHUFB
does not move bytes between the two 128-bit halves of a 256-bit vector. There are various solutions depending on what your indexes look like but in general the idea is to mostly rely on shuffling bytes, and handling movement between the two 128-bit parts separately.
For example, you can make a 256-bit vector that has two copies of the lower half of the data, another 256-bit vector that has two copies of the upper half of the data, shuffle each of these, then blend based on whether you want a byte from the lower or the upper part. In general you can do any 32 byte shuffle with that, and you can build a word shuffle on top of it.
Upvotes: 1