Reputation:

Fast conversion of 16-bit big-endian to little-endian in ARM

I need to convert big arrays of 16-bit integer values from big-endian to little-endian format.

Now I use for conversion the following function:

inline void Reorder16bit(const uint8_t * src, uint8_t * dst)
{
    uint16_t value = *(uint16_t*)src;
    *(uint16_t*)dst = value >> 8 | value << 8;
}

void Reorder16bit(const uint8_t * src, size_t size, uint8_t * dst)
{
    assert(size%2 == 0);
    for(size_t i = 0; i < size; i += 2)
        Reorder16bit(src + i, dst + i);
}

I use GCC. Target platform is ARMv7 (Raspberry Phi 2B).

Is there any way to optimize it?

This conversion is needed for loading audio samples which can be as in little- endian as in big-endian format. Of course it is not a bottleneck now, but it takes about 10% of total processing time. And I think that is too much for such a simple operation.

Upvotes: 7

Answers (5)

auselen

Reputation: 28087

https://goo.gl/4bRGNh

int swap(int b) {
  return __builtin_bswap16(b);
}

becomes

swap(int):
        rev16   r0, r0
        uxth    r0, r0
        bx      lr

So yours could be written as (gcc-explorer: https://goo.gl/HFLdMb)

void fast_Reorder16bit(const uint16_t * src, size_t size, uint16_t * dst)
{
    assert(size%2 == 0);
    for(size_t i = 0; i < size; i++)
        dst[i] = __builtin_bswap16(src[i]);
}

which should make for loop

.L13:
        ldrh    r4, [r0, r3]
        rev16   r4, r4
        strh    r4, [r2, r3]    @ movhi
        adds    r3, r3, #2
        cmp     r3, r1
        bne     .L13

read more about __builtin_bswap16 at GCC builtin docs.

Neon suggestion (kinda tested, gcc-explorer: https://goo.gl/fLNYuc):

void neon_Reorder16bit(const uint8_t * src, size_t size, uint8_t * dst)
{
  assert(size%16 == 0);
  //uint16x8_t vld1q_u16 (const uint16_t *) 
  //vrev64q_u16(uint16x8_t vec);
  //void vst1q_u16 (uint16_t *, uint16x8_t) 
  for (size_t i = 0; i < size; i += 16)
    vst1q_u8(dst + i, vrev16q_u8(vld1q_u8(src + i)));
}

which becomes

.L23:
        adds    r5, r0, r3
        adds    r4, r2, r3
        adds    r3, r3, #16
        vld1.8  {d16-d17}, [r5]
        cmp     r1, r3
        vrev16.8        q8, q8
        vst1.8  {d16-d17}, [r4]
        bhi     .L23

See more about neon intrinsics here: https://gcc.gnu.org/onlinedocs/gcc-4.4.1/gcc/ARM-NEON-Intrinsics.html

Bonus from ARM ARM A8.8.386:

VREV16 (Vector Reverse in halfwords) reverses the order of 8-bit elements in each halfword of the vector, and places the result in the corresponding destination vector.

VREV32 (Vector Reverse in words) reverses the order of 8-bit or 16-bit elements in each word of the vector, and places the result in the corresponding destination vector.

VREV64 (Vector Reverse in doublewords) reverses the order of 8-bit, 16-bit, or 32-bit elements in each doubleword of the vector, and places the result in the corresponding destination vector.

There is no distinction between data types, other than size.

Upvotes: 8

Jens

Reputation: 9416

I don't know much about ARM instruction sets, but I guess there are some special instruction to endianess conversion. Apparently, ARMv7 has things like rev etc.

Have you tried the compiler intrinsic __builtin_bswap16? It should compile to CPU-specific code, e.g. rev on ARM. In addition, it helps the compiler to recognize that you are actually doing a byte-swap, and it performs other optimizations with that knowledge, e.g. eliminate redundant byte swaps entirely in cases like y=swap(x); y &= some_value; x = swap(y);.

I googled a little bit, and this thread discusses an issue with the optimization potential. According to this discussion, the compiler can also vectorize the conversion if the CPU supports the vrev NEON instruction.

Upvotes: 6

Colin

Reputation: 3524

If it's specifically for ARM there's a REV instruction, specifically REV16 which would do two 16 bit ints at a time.

Upvotes: 6

ErmIg

Reputation: 4038

If you want to improve performance of your code you can make following:

1) Processing of 4-bytes for one step:

inline void Reorder16bit(const uint8_t * src, uint8_t * dst)
{
    uint16_t value = *(uint16_t*)src;
    *(uint16_t*)dst = value >> 8 | value << 8;
}

inline void Reorder16bit2(const uint8_t * src, uint8_t * dst)
{
    uint32_t value = *(uint32_t*)src;
    *(size_t*)dst = (value & 0xFF00FF00) >> 8 | (value & 0x00FF00FF) << 8;
}

void Reorder16bit(const uint8_t * src, size_t size, uint8_t * dst)
{
    assert(size%2 == 0);

    size_t alignedSize = size/4*4;
    for(size_t i = 0; i < alignedSize; i += 4)
        Reorder16bit2(src + i, dst + i);
    for(size_t i = alignedSize; i < size; i += 2)
        Reorder16bit(src + i, dst + i);
}

If you use a 64-bit platform, it is possible to process 8 bytes for one step the same way.

2) ARMv7 platform supports SIMD instructions called NEON. With using of them you can make you code even faster then in 1):

inline void Reorder16bit(const uint8_t * src, uint8_t * dst)
{
    uint16_t value = *(uint16_t*)src;
    *(uint16_t*)dst = value >> 8 | value << 8;
}

inline void Reorder16bit8(const uint8_t * src, uint8_t * dst)
{
    uint8x16_t _src = vld1q_u8(src);
    vst1q_u8(dst, vrev16q_u8(_src));
}

void Reorder16bit(const uint8_t * src, size_t size, uint8_t * dst)
{
    assert(size%2 == 0);

    size_t alignedSize = size/16*16;
    for(size_t i = 0; i < alignedSize; i += 16)
        Reorder16bit8(src + i, dst + i);
    for(size_t i = alignedSize; i < size; i += 2)
        Reorder16bit(src + i, dst + i);
}

Upvotes: 8

1201ProgramAlarm

Reputation: 32727

You'd want to measure to see which is faster, but an alternative body for Reorder16bit would be

*(uint16_t*)dst = 256 * src[0] + src[1];

assuming that your native ints are little-endian. Another possibility:

dst[0] = src[1];
dst[1] = src[0];

Upvotes: 2

Fast conversion of 16-bit big-endian to little-endian in ARM

Answers (5)

Related Questions