Reputation: 661
I am new to GCC's C vector extensions. I am considering use of them in my project, but their utility is (somewhat) contingent on the ability to efficiently move all elements in a vector one position to the left and store the result in a new vector. How can I do this efficiently (such as in a SIMD-accelerated way)?
So, basically:
Background information (you can skip this): The purpose of such a transformation is in dealing with matrices where each row is represented with vectors. Specifically, it would enable one to treat ShiftedVector as the upper-left diagonal for the row beneath, and compare all values in one SIMD operation. If there is another way to compare a vector with another vector offset by one element, that would solve the problem too. But I'm assuming not, and that the most efficient way to perform this comparison is to move all the elements leftward and do the comparison 1:1.
General stipulations:
It seems the bottleneck here is the lack of general information on the process of using the intrinsics. It seems people are either using assembly (which I am no expert in) or auto-vectorization (which doesn't work well here), so vector types are the most logical choice.
Thanks!
Upvotes: 3
Views: 332
Reputation: 8992
The fastest shift is no shift at all (i.e. no move, no copy):
int Data[16] = {
1, 2, 3, 4, 5, 6, 7, 8,
0, 0, 0, 0, 0, 0, 0, 0,
};
int* Ptr = Data;
// first shift
Ptr++;
// second shift
Ptr++;
// and so on.
If the algorithm allows that (i.e. the number of shifts is limited and known in advance) it's possible to reserve enough space, and make "shifts" just by incrementing a pointer.
Upvotes: 0
Reputation: 661
Crawling around in the depths of the manual, I uncovered this bit of tomfoolery:
typedef int v8si __attribute__ ((vector_size (32)));
v8si OriginalVector, masker, ShiftedVector;
OriginalVector = {1, 2, 3, 4, 5, 6, 7, 8};
masker = {1,2,3,4,5,6,7,0};
ShiftedVector = __builtin_shuffle(OriginalVector, masker);
Where I put a 0 at the end of "masker" for no reason (any element 0-7 would work). What this does is just map the elements in the original to the positions defined in masker, and save them to the result.
But although this is an answer, it may not be the "best" answer, since I imagine there is a better way than creating a new vector, taking up a register with the new vector, assigning positions, taking each element out of place and putting it in another arbitrary place, and saving the result.
Yes, we can cache the masker outside the loop or something instead of creating it every time, but I imagine there's some simple "permute left" instruction somewhere which can just slide it over...
Upvotes: 1