Reputation: 345
I'm doing a project on an ARM Cortex M0, which does not support unaligned(by 4bytes) access, and I'm trying to optimize the speed of operations on unaligned data.
I'm storing Bluetooth Low Energy Access addresses (48bit) as 6-byte arrays in some packed structs acting as packet buffers. Because of the packing, the BLE addresses are not necessarily starting at a word aligned address, and I'm running into some complications when optimizing my access functions to these addresses.
The first, and most obvious approach is a for loop operating on each byte in the array individually. Checking if two addresses are the same could for instance be done like this:
uint8_t ble_adv_addr_is_equal(uint8_t* addr1, uint8_t* addr2)
{
for (uint32_t i = 0; i < 6; ++i)
{
if (addr1[i] != addr2[i])
return 0;
}
return 1;
}
I'm doing a lot of comparisons in my project, and I wanted to see if I could squeeze some more speed out of this function. I realised that for aligned addresses, I could cast them to uint64_t, and compare with 48 bit masks applied, i.e.
((uint64_t)&addr1[0] & 0xFFFFFFFFFFFF) == ((uint64_t)&addr2[0] & 0xFFFFFFFFFFFF)
Similar operations could be done for writing, and it works well for aligned versions. However, since my addresses aren't always word-aligned (or even half-word), I would have to do some extra tricks to make this work.
First off, I came up with this unoptimized nightmare of a compiler macro:
#define ADDR_ALIGNED(_addr) (uint64_t)(((*((uint64_t*)(((uint32_t)_addr) & ~0x03)) >> (8*(((uint32_t)_addr) & 0x03))) & 0x000000FFFFFFFF)\
| (((*((uint64_t*)(((uint32_t)_addr+4) & ~0x03))) << (32-8*(((uint32_t)_addr) & 0x03)))) & 0x00FFFF00000000)
It basically shifts the entire address to start at the previous word aligned memory position, regardless of offset. For instance:
0 1 2 3
|-------|-------|-------|-------|
|.......|.......|.......|<ADDR0>|
|<ADDR1>|<ADDR2>|<ADDR3>|<ADDR4>|
|<ADDR5>|.......|.......|.......|
becomes
0 1 2 3
|-------|-------|-------|-------|
|<ADDR0>|<ADDR1>|<ADDR2>|<ADDR3>|
|<ADDR4>|<ADDR5>|.......|.......|
|.......|.......|.......|.......|
and I can safely do a 64-bit comparison of two addresses, regardless of their actual alignment:
ADDR_ALIGNED(addr1) == ADDR_ALIGNED(addr2)
Neat! But this operation takes 71 lines of assembly when compiled with the ARM-MDK, compared to 53 when doing the comparison in a simple for loop (I'm just going to disregard the additional time spent in the branch instructions here), and ~30 when unrolled. Also, it doesn't work for writes, as the alignment only happens in the registers, not in memory. Unaligning it again would require a similar operation, and the whole approach generally seems to suck.
Is an unrolled for-loop working each byte individually really the fastest solution for cases like this? Does anyone have any experience with similar scenarios, and feel like sharing some of their wizardry here?
Upvotes: 17
Views: 3549
Reputation: 12515
UPDATE
Ok, because your data has no alignment whatsover, you need to either read all the data in, byte by byte, into properly aligned buffers and then do really fast 64-bit compares, or, if you won't be using the data after the compares, just read in the data as bytes and do 6 compares, in which case calling memcmp()
might be a better option.
For at least 16-bit aligned:
u16 *src1 = (u16 *)addr1;
u16 *src2 = (u16 *)addr2;
for (int i = 0; i < 3; ++i)
{
if (src1[i] != src2[i])
return 0;
}
return 1;
Will be twice as fast as byte comparisons and might be the best you can reasonably do as long as your data is at least 2-byte aligned. I'd also expect the compiler to remove the for loop completely and just use conditionally executed if statements instead.
Trying for 32-bit aligned reads will not be faster unless you can guarentee that source1 and 2 are similiarly aligned (add1 & 0x03) == (addr2 & 0x03). If this is the case, then you can read in a 32-bit value and then a 16-bit (or visa-versa, depending on starting alignment) and remove 1 more compare.
As 16-bit is your shared base, you can start there and the compiler should generate nice ldrh
type opcodes.
Upvotes: 6
Reputation: 5118
When reading this SIMD-class documentation I found how to allocate variables, both statically and dynamically, with proper alignment. http://www.agner.org/optimize/vectorclass.pdf
Page 101
Windows, write:
__declspec(align(16)) int mydata[1000];
In Unix-like systems, write:
int mydata[1000] __attribute__((aligned(16)));
Page 16
If you need an array of a size that is determined at runtime, then you will have a problem with alignment. Each vector needs to be stored at an address divisible by 16, 32 or 64 bytes, according to its size. The compiler can do this when defining a fixed-size array, as in the above example, but not necessarily with dynamic memory allocation. If you create an array of dynamic size by using new, malloc or an STL container, or any other method, then you may not get the proper alignment for vectors and the program will very likely crash when accessing a misaligned vector. The C++ standard says "It is implementation defined if new-expression, [...] support over-aligned types". Possible solutions are to use posix_memalign, _aligned_malloc, std::aligned_storage, std::align, etc. depending on what is supported by your compiler, but the method may not be portable to all platforms.
Upvotes: -3
Reputation:
You might get your compiler to choose the fastest way for you:
#include <stdint.h>
#include <stddef.h>
#include <string.h>
uint64_t unalignedload(char const *packed)
{
uint64_t buffer;
memcpy(&buffer, packed, 8);
return buffer;
}
This is not quite what you want, as loading 8 bytes might segfault if you're unaligned and run off the page, but it's a start. If you can add two bytes of padding to the end of the array, you can easily avoid that problem.
gcc and clang seem to optimize this well.
Upvotes: 4