Cyan
Cyan

Reputation: 13968

Swap halves of a NEON vector with C/gcc intrinsics: no intrinsic for VSWP?

I'm trying to do something relatively simple using NEON vector instructions : given an uint64x2_t, I want to swap position of the 64-bit members.

Aka, if this was a simple normal code :

typedef struct {
    U64 u[2];
} u64x2;


u64x2 swap(u64x2 in)
{
    u64x2 out;
    out.u[0] = in.u[1];
    out.u[1] = in.u[0];
    return out;
}

Surprisingly enough, I can't find an intrinsic for that. There is apparently an assembler instruction for it (VSWP) but no corresponding intrinsic.

This is weird. It's about as trivial an operation as it can be, so it must be possible. The question is : how ?

edit : for reference, godbolt outcome using @Jake answer : https://godbolt.org/z/ueJ6nB . No vswp, but vext works great.

Upvotes: 3

Views: 993

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 365576

Another way to express this shuffle is with GNU C native vector builtins that give target-independent ways to do a given operation. Compile-time constant shuffle masks can get optimized into immediate shuffles, according to what the target supports. But runtime-variable shuffles could be inefficient depending on target ISA support.

#include <arm_neon.h>

#ifndef __clang__
uint64x2_t swap_GNU_shuffle(uint64x2_t in)
{
    uint64x2_t mask = {1,0};
    uint64x2_t out = __builtin_shuffle (in, mask);
    return out;
}
#endif

AArch64 gcc8.2 on Godbolt does actually compile to the same shuffle Jake suggested, not SWP:

swap_GNU_shuffle:
        ext     v0.16b, v0.16b, v0.16b, #8
        ret

Clang also optimizes most of our pure-C attempts to an ext instruction, including one that uses memcpy to type-pun to your plain struct and back. Unlike GCC, which doesn't have as good a shuffle optimizer. (On Godbolt, use any clang from the dropdown with -O3 -target arm64. clang is normally built with support for multiple target ISAs by default, unlike GCC.)

So either all of these compilers have missed optimizations for tune=generic and -mcpu=cortex-a53, a57, and a75, or else ext is actually a good choice, perhaps better than swp which has to write 2 output registers instead of logically writing one full-width register. But usually that's not a problem for ARM; quite a few instructions can do that and they usually make it efficient.

ARM's timing info for Cortex-A8 has the same numbers for vext and vswp (both are 1 cycle latency from Qn to Qoutput, but 2 cycles from Qm to Qoutput). I haven't checked newer cores (or any 64-bit cores).

Upvotes: 5

You are right, NEON intrinsics doesn't support the VSWP instruction.

However, you can resort to the VEXT instruction instead which is also available in intrinsics.

out = vextq_u64(in, in, 1);


Alternately, you can make use of vcombine (and pray that the compiler doesn't mess it up):

out = vcombine_U64(vget_high_u64(in), vget_low_u64(in));

But beware, the compilers tend to generate FUBAR machine codes when they see vcombine and/or vget.

Stay with the former, that's my advice.

Upvotes: 5

Related Questions