Reputation: 519
ARM CPUs at least up to ARMv5 do not allow random access to memory addresses which are not word aligned. The problem is described in length here: http://lecs.cs.ucla.edu/wiki/index.php/XScale_alignment – One solution is to rewrite your code or consider this alignment in the first place. However it's not said how. Given a byte stream where I have 2- or 4-byte integers which are not word aligned in the stream. How do I access this data in a smart way without losing to much performance?
I have a code snippet which illustrates the problem:
#include <stdio.h>
#include <stdlib.h>
#define BUF_LEN 17
int main( int argc, char *argv[] ) {
unsigned char buf[BUF_LEN];
int i;
unsigned short *p_short;
unsigned long *p_long;
/* fill array */
(void) printf( "filling buffer:" );
for ( i = 0; i < BUF_LEN; i++ ) {
/* buf[i] = 1 << ( i % 8 ); */
buf[i] = i;
(void) printf( " %02hhX", buf[i] );
}
(void) printf( "\n" );
/* testing with short */
(void) printf( "accessing with short:" );
for ( i = 0; i < BUF_LEN - sizeof(unsigned short); i++ ) {
p_short = (unsigned short *) &buf[i];
(void) printf( " %04hX", *p_short );
}
(void) printf( "\n" );
/* testing with long */
(void) printf( "accessing with long:" );
for ( i = 0; i < BUF_LEN - sizeof(unsigned long); i++ ) {
p_long = (unsigned long *) &buf[i];
(void) printf( " %08lX", *p_long );
}
(void) printf( "\n" );
return EXIT_SUCCESS;
}
On a x86 CPU this is the output:
filling buffer: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10
accessing with short: 0100 0201 0302 0403 0504 0605 0706 0807 0908 0A09 0B0A 0C0B 0D0C 0E0D 0F0E
accessing with long: 03020100 04030201 05040302 06050403 07060504 08070605 09080706 0A090807 0B0A0908 0C0B0A09 0D0C0B0A 0E0D0C0B 0F0E0D0C
On a ATMEL AT91SAM9G20 ARMv5 core I get (note: this is the expected behaviour of this CPU!):
filling buffer: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10
accessing with short: 0100 0100 0302 0302 0504 0504 0706 0706 0908 0908 0B0A 0B0A 0D0C 0D0C 0F0E
accessing with long: 03020100 00030201 01000302 02010003 07060504 04070605 05040706 06050407 0B0A0908 080B0A09 09080B0A 0A09080B 0F0E0D0C
So given I want or have to access the byte stream at not aligned addresses: how would I do that efficiently on ARM?
Upvotes: 1
Views: 1185
Reputation: 41200
This function always uses aligned 32-bit accesses:
uint32_t fetch_unaligned_uint32 (uint8_t *unaligned_stream)
{
switch (((uint32_t )unaligned_stream) & 3u)
{
case 3u:
return ((*(uint32_t *)unaligned_stream[-3]) << 24)
| ((*(uint32_t *)unaligned_stream[ 1]) & 0xffffffu);
case 2u:
return ((*(uint32_t *)unaligned_stream[-2]) << 16)
| ((*(uint32_t *)unaligned_stream[ 2]) & 0x00ffffu);
case 1u:
return ((*(uint32_t *)unaligned_stream[-1]) << 8)
| ((*(uint32_t *)unaligned_stream[ 3]) & 0x0000ffu);
case 0u:
default:
return *(uint32_t *)unaligned_stream;
}
}
It may be faster than reading and shifting all 4 bytes separately.
Upvotes: 0
Reputation: 71556
Your example will demonstrate problems on any platform. the simple fix of course:
unsigned char *buf;
int i;
unsigned short *p_short;
unsigned long p_long[BUF_LEN>>2];
if you cannot organize the data with better alignment (more bytes can at times equal better performance) then do the obvious and address everything as 32 bits and chop out portions from there, the optimizer will take care of a lot of it for the shorts and bytes within a word (actually including bytes and shorts in your structures, be they structures or bytes picked out of memory, can be more costly as there will be extra instructions than if you passed everything around as words, you have to do your system engineering).
An example to extract an unaligned word. (have to manage your endians of course)
a = (lptr[offset]<<16)|(lptr[offset+1]>>16);
All arm cores from the armv4 to the present allow unaligned access, most by default have the exception turned on but you can turn it off. Now the older ones rotate within the word but others can grab other byte lanes if I am not mistaken.
Do your system engineering, do your performance analysis and determine if moving everything as words is faster or slower. The actual moving of data will have some overhead, but code on both sides will run much faster if everything is aligned. Can you suffer some number X times slower data move to have a 2x to 4x improvement on generation and reception of that data?
Upvotes: 1
Reputation: 11896
You write your own packing/unpacking functions, which translate between aligned variables and the unaligned byte stream. For example,
void unpack_uint32(uint8_t* unaligned_stream, uint32_t* aligned_var)
{
// copy byte-by-byte from stream to var, you can fill in the details
}
Upvotes: 2