Reputation:
I need to convert big arrays of 16-bit integer values from big-endian to little-endian format.
Now I use for conversion the following function:
inline void Reorder16bit(const uint8_t * src, uint8_t * dst)
{
uint16_t value = *(uint16_t*)src;
*(uint16_t*)dst = value >> 8 | value << 8;
}
void Reorder16bit(const uint8_t * src, size_t size, uint8_t * dst)
{
assert(size%2 == 0);
for(size_t i = 0; i < size; i += 2)
Reorder16bit(src + i, dst + i);
}
I use GCC. Target platform is ARMv7 (Raspberry Phi 2B).
Is there any way to optimize it?
This conversion is needed for loading audio samples which can be as in little- endian as in big-endian format. Of course it is not a bottleneck now, but it takes about 10% of total processing time. And I think that is too much for such a simple operation.
Upvotes: 7
Views: 7425
Reputation: 28087
int swap(int b) {
return __builtin_bswap16(b);
}
becomes
swap(int):
rev16 r0, r0
uxth r0, r0
bx lr
So yours could be written as (gcc-explorer: https://goo.gl/HFLdMb)
void fast_Reorder16bit(const uint16_t * src, size_t size, uint16_t * dst)
{
assert(size%2 == 0);
for(size_t i = 0; i < size; i++)
dst[i] = __builtin_bswap16(src[i]);
}
which should make for loop
.L13:
ldrh r4, [r0, r3]
rev16 r4, r4
strh r4, [r2, r3] @ movhi
adds r3, r3, #2
cmp r3, r1
bne .L13
read more about __builtin_bswap16
at GCC builtin docs.
Neon suggestion (kinda tested, gcc-explorer: https://goo.gl/fLNYuc):
void neon_Reorder16bit(const uint8_t * src, size_t size, uint8_t * dst)
{
assert(size%16 == 0);
//uint16x8_t vld1q_u16 (const uint16_t *)
//vrev64q_u16(uint16x8_t vec);
//void vst1q_u16 (uint16_t *, uint16x8_t)
for (size_t i = 0; i < size; i += 16)
vst1q_u8(dst + i, vrev16q_u8(vld1q_u8(src + i)));
}
which becomes
.L23:
adds r5, r0, r3
adds r4, r2, r3
adds r3, r3, #16
vld1.8 {d16-d17}, [r5]
cmp r1, r3
vrev16.8 q8, q8
vst1.8 {d16-d17}, [r4]
bhi .L23
See more about neon intrinsics here: https://gcc.gnu.org/onlinedocs/gcc-4.4.1/gcc/ARM-NEON-Intrinsics.html
Bonus from ARM ARM A8.8.386:
VREV16 (Vector Reverse in halfwords) reverses the order of 8-bit elements in each halfword of the vector, and places the result in the corresponding destination vector.
VREV32 (Vector Reverse in words) reverses the order of 8-bit or 16-bit elements in each word of the vector, and places the result in the corresponding destination vector.
VREV64 (Vector Reverse in doublewords) reverses the order of 8-bit, 16-bit, or 32-bit elements in each doubleword of the vector, and places the result in the corresponding destination vector.
There is no distinction between data types, other than size.
Upvotes: 8
Reputation: 9416
I don't know much about ARM instruction sets, but I guess there are some special instruction to endianess conversion. Apparently, ARMv7 has things like rev etc.
Have you tried the compiler intrinsic __builtin_bswap16
? It should compile to CPU-specific code, e.g. rev on ARM. In addition, it helps the compiler to recognize that you are actually doing a byte-swap, and it performs other optimizations with that knowledge, e.g. eliminate redundant byte swaps entirely in cases like y=swap(x); y &= some_value; x = swap(y);
.
I googled a little bit, and this thread discusses an issue with the optimization potential. According to this discussion, the compiler can also vectorize the conversion if the CPU supports the vrev
NEON instruction.
Upvotes: 6
Reputation: 3524
If it's specifically for ARM there's a REV instruction, specifically REV16 which would do two 16 bit ints at a time.
Upvotes: 6
Reputation: 4038
If you want to improve performance of your code you can make following:
1) Processing of 4-bytes for one step:
inline void Reorder16bit(const uint8_t * src, uint8_t * dst)
{
uint16_t value = *(uint16_t*)src;
*(uint16_t*)dst = value >> 8 | value << 8;
}
inline void Reorder16bit2(const uint8_t * src, uint8_t * dst)
{
uint32_t value = *(uint32_t*)src;
*(size_t*)dst = (value & 0xFF00FF00) >> 8 | (value & 0x00FF00FF) << 8;
}
void Reorder16bit(const uint8_t * src, size_t size, uint8_t * dst)
{
assert(size%2 == 0);
size_t alignedSize = size/4*4;
for(size_t i = 0; i < alignedSize; i += 4)
Reorder16bit2(src + i, dst + i);
for(size_t i = alignedSize; i < size; i += 2)
Reorder16bit(src + i, dst + i);
}
If you use a 64-bit platform, it is possible to process 8 bytes for one step the same way.
2) ARMv7 platform supports SIMD instructions called NEON. With using of them you can make you code even faster then in 1):
inline void Reorder16bit(const uint8_t * src, uint8_t * dst)
{
uint16_t value = *(uint16_t*)src;
*(uint16_t*)dst = value >> 8 | value << 8;
}
inline void Reorder16bit8(const uint8_t * src, uint8_t * dst)
{
uint8x16_t _src = vld1q_u8(src);
vst1q_u8(dst, vrev16q_u8(_src));
}
void Reorder16bit(const uint8_t * src, size_t size, uint8_t * dst)
{
assert(size%2 == 0);
size_t alignedSize = size/16*16;
for(size_t i = 0; i < alignedSize; i += 16)
Reorder16bit8(src + i, dst + i);
for(size_t i = alignedSize; i < size; i += 2)
Reorder16bit(src + i, dst + i);
}
Upvotes: 8
Reputation: 32727
You'd want to measure to see which is faster, but an alternative body for Reorder16bit
would be
*(uint16_t*)dst = 256 * src[0] + src[1];
assuming that your native ints are little-endian. Another possibility:
dst[0] = src[1];
dst[1] = src[0];
Upvotes: 2