Reputation: 31
I would like to implement numpy.triu_indices(a, 1)(note that the second argument is 1) in c++ with avx intrinsics. The code snippet below is the unvectorized version of the code I came up with. Here, a is the length (int), and first and second are the two output arrays
void triu_indices(int a, int* first, int* second)
{
int index = 0;
for (int i=0; i < a; i++)
{
for(int j = i+1; j < a; j++)
{
first[index] = i;
second[index] = j;
index++;
}
}
}
As an example output, if I give a=4, then
first = [0,0,0,1,1,2]
second = [1,2,3,2,3,3]
Now, I would like to implement this fully in AVX2 (that is in a vectorized way). Ultimately, the function will be run over an entire array of ints, which will supply the variable a
to the function, and the output arrays first
and second
will be stored in two parent arrays.
Can you please give me some useful hints (or a code snippet) as how to vectorize this function with explicit AVX2 instrinsics (that is, not depending on compiler auto-vectorization)? Sorry if this is a noob question, as I have started learning AVX recently.
Upvotes: 0
Views: 324
Reputation: 366095
First of all, make sure you really need to do this, and actually want arrays of indices as an end result, not as part of keeping track of data in a triangular matrix. AVX2 has gather support, and AVX512 has scatter support, but introducing an array of indices makes SIMD much worse.
For looping over triangular matrices, and for i,j to linear, see algorithm of addressing a triangle matrices memory using assembly. (You might want to pad the indexing so each row starts at a 32-byte aligned boundary. i.e. round up the length of each row to a multiple of 8 float elements, a whole AVX vector. This also makes it easier to loop over a matrix with AVX vectors: you can store garbage into the padding at the end of a row instead of having the last vector of a row include some elements from the start of the next row.)
For linear -> i,j, the closed-form formula includes sqrt
(also a C++ version), so it's possible that array lookups could be useful, but really you should avoid doing this at all. (e.g. if looping over a triangular matrix in packed format, keep track of where you are in i,j as well as linear so you don't need a lookup when you find an element you're looking for.)
For large arrays, it breaks down into whole vectors pretty nicely, getting tricky only at the ends of rows.
You might use a pre-defined vector constant for the special case of the last corner where you have multiple triangle rows within the same vector of 4 or 8 int
elements.
first = [0,0,0,1,1,2]
With a larger triangle, we're storing large runs of the same number (like memset
), then a slightly-shorter run of the next number, etc. i.e. storing a whole row of 0
s is easy. For all but the last couple rows, these runs are larger than 1 vector element.
second = [1,2,3,2,3,3]
Again, within a single row it's a simple pattern to vectorize. To store an increasing sequence, start with a vector of {1,2,3,4}
and increment it with SIMD add with {4,4,4,4}
, i.e. _mm_set1_epi32(1)
. For 256-bit AVX2 vectors, _mm256_set1_epi32(8)
to increment an 8-element vector by 8.
So within the inner-most loop, you're just storing one invariant vector, and using _mm256_add_epi32
on another and storing it to the other array.
Compilers can already auto-vectorize your function pretty decently, although the end-of-row handling is a lot more complex than you could do by hand. With your code on the Godbolt compiler explorer (with __restrict
to tell the compiler the output arrays don't overlap, and __builtin_assume_aligned
to tell the compilers that they're aligned), we get an inner loop like this (from gcc):
.L4: # do {
movups XMMWORD PTR [rcx+rax], xmm0 # _mm_store_si128(&second[index], xmm0)
paddd xmm0, xmm2 # _mm_add_epi32
movups XMMWORD PTR [r10+rax], xmm1 # _mm_store_si128(&second[index], counter_vec)
add rax, 16 # index += 4 (16 bytes)
cmp rax, r9
jne .L4 # }while(index != end_row)
If I have time, I might write this up in more detail, including better handling of the ends of rows. e.g. partially-overlapping store that ends at the end of the row is often good. Or unroll the outer loop so the inner loops have a repeating pattern of cleanup.
Calculating the starting vectors for the next outer-loop iteration could be done with something like:
vsecond = _mm256_add_epi32(vsecond, _mm256_set1_epi32(1));
vfirst = _mm256_add_epi32(vfirst, _mm256_set1_epi32(1));
i.e. turn {0,0,0,0,...}
into {1,1,1,1,...}
by adding a vector of all ones. And turn {1,2,3,4,5,6,7,8}
into {2,3,4,5,6,7,8,9}
by adding a vector of all ones.
Upvotes: 3