Reputation: 13968
I'm trying to do something relatively simple using NEON vector instructions :
given an uint64x2_t
, I want to swap position of the 64-bit members.
Aka, if this was a simple normal code :
typedef struct {
U64 u[2];
} u64x2;
u64x2 swap(u64x2 in)
{
u64x2 out;
out.u[0] = in.u[1];
out.u[1] = in.u[0];
return out;
}
Surprisingly enough, I can't find an intrinsic for that. There is apparently an assembler instruction for it (VSWP
) but no corresponding intrinsic.
This is weird. It's about as trivial an operation as it can be, so it must be possible. The question is : how ?
edit : for reference, godbolt
outcome using @Jake answer :
https://godbolt.org/z/ueJ6nB .
No vswp
, but vext
works great.
Upvotes: 3
Views: 993
Reputation: 365576
Another way to express this shuffle is with GNU C native vector builtins that give target-independent ways to do a given operation. Compile-time constant shuffle masks can get optimized into immediate shuffles, according to what the target supports. But runtime-variable shuffles could be inefficient depending on target ISA support.
#include <arm_neon.h>
#ifndef __clang__
uint64x2_t swap_GNU_shuffle(uint64x2_t in)
{
uint64x2_t mask = {1,0};
uint64x2_t out = __builtin_shuffle (in, mask);
return out;
}
#endif
AArch64 gcc8.2 on Godbolt does actually compile to the same shuffle Jake suggested, not SWP:
swap_GNU_shuffle:
ext v0.16b, v0.16b, v0.16b, #8
ret
Clang also optimizes most of our pure-C attempts to an ext
instruction, including one that uses memcpy
to type-pun to your plain struct and back. Unlike GCC, which doesn't have as good a shuffle optimizer. (On Godbolt, use any clang
from the dropdown with -O3 -target arm64
. clang is normally built with support for multiple target ISAs by default, unlike GCC.)
So either all of these compilers have missed optimizations for tune=generic and -mcpu=cortex-a53
, a57
, and a75
, or else ext
is actually a good choice, perhaps better than swp
which has to write 2 output registers instead of logically writing one full-width register. But usually that's not a problem for ARM; quite a few instructions can do that and they usually make it efficient.
ARM's timing info for Cortex-A8 has the same numbers for vext
and vswp
(both are 1 cycle latency from Qn
to Q
output, but 2 cycles from Qm
to Q
output). I haven't checked newer cores (or any 64-bit cores).
Upvotes: 5
Reputation: 6354
You are right, NEON intrinsics doesn't support the VSWP
instruction.
However, you can resort to the VEXT
instruction instead which is also available in intrinsics.
out = vextq_u64(in, in, 1);
Alternately, you can make use of vcombine
(and pray that the compiler doesn't mess it up):
out = vcombine_U64(vget_high_u64(in), vget_low_u64(in));
But beware, the compilers tend to generate FUBAR machine codes when they see vcombine
and/or vget
.
Stay with the former, that's my advice.
Upvotes: 5