Reputation: 2179
I am developing on an embedded device (STM32, ARM-Cortex M4) and expected memset
and similar functions to be optimized for speed. However, I noticed much slower behavior than expected. I'm using GNU ARM embedded compiler/linker (arm-none-eabi-gcc
, etc) with the -O3
optimization flag.
I looked into the disassembly and the memset
function is writing one byte at a time and rechecking bounds at each iteration.
0x802e2c4 <memset>: add r2, r0
0x802e2c6 <memset+2>: mov r3, r0
0x802e2c8 <memset+4>: cmp r3, r2
0x802e2ca <memset+6>: bne.n 0x802e2ce <memset+10>
0x802e2cc <memset+8>: bx lr
0x802e2ce <memset+10>: strb.w r1, [r3], #1
0x802e2d2 <memset+14>: b.n 0x802e2c8
Naturally, this code could be sped up by using 32-bit writes and/or loop unrolling at the expense of code size. It is possible the implementers chose not to optimize this for speed in order to keep code size down.
The memset
header and library are being included from:
C:\Program Files (x86)\GNU Tools Arm Embedded\7 2018-q2-update\arm-none-eabi\include\string.h
C:\Program Files (x86)\GNU Tools Arm Embedded\7 2018-q2-update\arm-none-eabi\include\c++\7.3.1\cmath
This question is similar to existing questions but is different in that it targets an embedded platform.
Is there an optimized memset readily available within the GNU ARM embedded package? If so how can I access it?
Upvotes: 4
Views: 2528
Reputation: 2754
Link without -specs=nano.specs
. This will use the version of the C library, which includes memset
, that is optimized for speed instead of size. This will pull in larger versions of many other functions (usual suspects: printf
and malloc
), which could again be optimized by additional linker options. Examining the disassembly and linker map file will help.
Upvotes: 1
Reputation: 2179
Not sure if GNU Tools ARM Embedded has an optimized memset, or how to access it via linker options, but it can be optimized in assembly manually. After defining this, the linker used this version without complaining about a redefined function, which seems odd to me. Overall speed increase is about 9x (i.e. this version takes about 11% as long as the original byte-wise method).
// optimized version of memset
// we split up the region into several segments
//
// base_ptr
// * store single bytes
// mid1
// * store words, 4 at a time
// mid2
// * store words, 1 at a time
// mid3
// * store single bytes
// end
//
// For large buffers, most of the time is spent between mid1 and mid2 which is
// highly optimized.
void * memset(void * base_ptr, int x, size_t length) {
const uint32_t int_size = sizeof(uint32_t);
static_assert(sizeof(uint32_t) == 4, "only supports 32 bit size");
// find first word-aligned address
uint32_t ptr = (uint32_t) base_ptr;
// get end of memory to set
uint32_t end = ptr + length;
// get location of first word-aligned address at/after the start, but not
// after the end
uint32_t mid1 = (ptr + int_size - 1) / int_size * int_size;
if (mid1 > end) {
mid1 = end;
}
// get location of last word-aligned address at/before the end
uint32_t mid3 = end / int_size * int_size;
// get end location of optimized section
uint32_t mid2 = mid1 + (mid3 - mid1) / (4 * int_size) * (4 * int_size);
// create a word-sized integer
uint32_t value = 0;
for (uint16_t i = 0; i < int_size; ++i) {
value <<= 8;
value |= (uint8_t) x;
}
__ASM volatile (
// store bytes
"b Compare1%=\n"
"Store1%=:\n"
"strb %[value], [%[ptr]], #1\n"
"Compare1%=:\n"
"cmp %[ptr], %[mid1]\n"
"bcc Store1%=\n"
// store words optimized
"b Compare2%=\n"
"Store2%=:\n"
"str %[value], [%[ptr]], #4\n"
"str %[value], [%[ptr]], #4\n"
"str %[value], [%[ptr]], #4\n"
"str %[value], [%[ptr]], #4\n"
"Compare2%=:\n"
"cmp %[ptr], %[mid2]\n"
"bcc Store2%=\n"
// store words
"b Compare3%=\n"
"Store3%=:\n"
"str %[value], [%[ptr]], #4\n"
"Compare3%=:\n"
"cmp %[ptr], %[mid3]\n"
"bcc Store3%=\n"
// store bytes
"b Compare4%=\n"
"Store4%=:\n"
"strb %[value], [%[ptr]], #1\n"
"Compare4%=:\n"
"cmp %[ptr], %[end]\n"
"bcc Store4%=\n"
: // no outputs
: [value] "r"(value),
[ptr] "r"(ptr),
[mid1] "r"(mid1),
[mid2] "r"(mid2),
[mid3] "r"(mid3),
[end] "r"(end)
);
return base_ptr;
}
Speed differences when operating on 32kB of data:
The max speed is 2 ticks (speed of str
instruction) per 4 bytes.
The original memset takes 16 bytes of code. The new one takes 98 bytes.
Upvotes: 4