devtk
devtk

Reputation: 2179

memset slow on 32-bit embedded platform

I am developing on an embedded device (STM32, ARM-Cortex M4) and expected memset and similar functions to be optimized for speed. However, I noticed much slower behavior than expected. I'm using GNU ARM embedded compiler/linker (arm-none-eabi-gcc, etc) with the -O3 optimization flag.

I looked into the disassembly and the memset function is writing one byte at a time and rechecking bounds at each iteration.

0x802e2c4 <memset>: add r2, r0
0x802e2c6 <memset+2>:   mov r3, r0
0x802e2c8 <memset+4>:   cmp r3, r2
0x802e2ca <memset+6>:   bne.n   0x802e2ce <memset+10>
0x802e2cc <memset+8>:   bx  lr
0x802e2ce <memset+10>:  strb.w  r1, [r3], #1
0x802e2d2 <memset+14>:  b.n 0x802e2c8

Naturally, this code could be sped up by using 32-bit writes and/or loop unrolling at the expense of code size. It is possible the implementers chose not to optimize this for speed in order to keep code size down.

The memset header and library are being included from:

C:\Program Files (x86)\GNU Tools Arm Embedded\7 2018-q2-update\arm-none-eabi\include\string.h
C:\Program Files (x86)\GNU Tools Arm Embedded\7 2018-q2-update\arm-none-eabi\include\c++\7.3.1\cmath

This question is similar to existing questions but is different in that it targets an embedded platform.

Is there an optimized memset readily available within the GNU ARM embedded package? If so how can I access it?

Upvotes: 4

Views: 2528

Answers (2)

Erlkoenig
Erlkoenig

Reputation: 2754

Link without -specs=nano.specs. This will use the version of the C library, which includes memset, that is optimized for speed instead of size. This will pull in larger versions of many other functions (usual suspects: printf and malloc), which could again be optimized by additional linker options. Examining the disassembly and linker map file will help.

Upvotes: 1

devtk
devtk

Reputation: 2179

Not sure if GNU Tools ARM Embedded has an optimized memset, or how to access it via linker options, but it can be optimized in assembly manually. After defining this, the linker used this version without complaining about a redefined function, which seems odd to me. Overall speed increase is about 9x (i.e. this version takes about 11% as long as the original byte-wise method).

// optimized version of memset
// we split up the region into several segments
//
// base_ptr
// * store single bytes
// mid1
// * store words, 4 at a time
// mid2
// * store words, 1 at a time
// mid3
// * store single bytes
// end
//
// For large buffers, most of the time is spent between mid1 and mid2 which is
// highly optimized.
void * memset(void * base_ptr, int x, size_t length) {
  const uint32_t int_size = sizeof(uint32_t);
  static_assert(sizeof(uint32_t) == 4, "only supports 32 bit size");
  // find first word-aligned address
  uint32_t ptr = (uint32_t) base_ptr;
  // get end of memory to set
  uint32_t end = ptr + length;
  // get location of first word-aligned address at/after the start, but not
  // after the end
  uint32_t mid1 = (ptr + int_size - 1) / int_size * int_size;
  if (mid1 > end) {
    mid1 = end;
  }
  // get location of last word-aligned address at/before the end
  uint32_t mid3 = end / int_size * int_size;
  // get end location of optimized section
  uint32_t mid2 = mid1 + (mid3 - mid1) / (4 * int_size) * (4 * int_size);
  // create a word-sized integer
  uint32_t value = 0;
  for (uint16_t i = 0; i < int_size; ++i) {
    value <<= 8;
    value |= (uint8_t) x;
  }
  __ASM volatile (
  // store bytes
  "b Compare1%=\n"
  "Store1%=:\n"
  "strb %[value], [%[ptr]], #1\n"
  "Compare1%=:\n"
  "cmp %[ptr], %[mid1]\n"
  "bcc Store1%=\n"
  // store words optimized
  "b Compare2%=\n"
  "Store2%=:\n"
  "str %[value], [%[ptr]], #4\n"
  "str %[value], [%[ptr]], #4\n"
  "str %[value], [%[ptr]], #4\n"
  "str %[value], [%[ptr]], #4\n"
  "Compare2%=:\n"
  "cmp %[ptr], %[mid2]\n"
  "bcc Store2%=\n"
  // store words
  "b Compare3%=\n"
  "Store3%=:\n"
  "str %[value], [%[ptr]], #4\n"
  "Compare3%=:\n"
  "cmp %[ptr], %[mid3]\n"
  "bcc Store3%=\n"
  // store bytes
  "b Compare4%=\n"
  "Store4%=:\n"
  "strb %[value], [%[ptr]], #1\n"
  "Compare4%=:\n"
  "cmp %[ptr], %[end]\n"
  "bcc Store4%=\n"
  : // no outputs
  : [value] "r"(value),
  [ptr] "r"(ptr),
  [mid1] "r"(mid1),
  [mid2] "r"(mid2),
  [mid3] "r"(mid3),
  [end] "r"(end)
  );
  return base_ptr;
}

Speed differences when operating on 32kB of data:

  • original memset: 197045 ticks (~6 per byte)
  • optimized memset: 22582 ticks (~0.7 per byte)
  • max theoretical speed: 16384 ticks

The max speed is 2 ticks (speed of str instruction) per 4 bytes.

The original memset takes 16 bytes of code. The new one takes 98 bytes.

Upvotes: 4

Related Questions