STM32F407 Assembly, STR not writing into memory

Question

I am trying to program my STM32F407 in pure assembly (learning purposes). I have narrowed my debugging down to the curious issue, that the STR command is simply not writing the contents of a register into memory. The code is below. Essentially it is supposed to write a 0x4 into a memory location (which sets up the GPIO ports), but doesn't.

Here's the code:

    /*SCiFI statements*/
    .syntax unified
    .cpu cortex-m4
    /*.fpu fpv4-sp-d16*/
    .thumb

    /*Vector Table Symbols*/
    .global vtable
    .global reset_handler

    /*vtable setup*/
    .type vtable, %object
vtable:
    .word _estack
    .word reset_handler
    .size vtable, . - vtable

    /*reset handler setup*/
    .type reset_handler, %function
reset_handler:
    /*Stack Pointer reset*/
    LDR r0, =_estack
    MOV sp, r0

    /*GPIOC is at 0x4002 0800 - 0x4002 0BFF*/
    LDR r2, =0x40020800 /*Address register*/
    /*Set up GPIOC as output*/
    /*Set to GPIO output mode*/
    LDR r1, =0x04
    STR r1, [r2] /*******HERE'S THE ISSUE LINE*******/

dummy_loop:
    B dummy_loop
    .size reset_handler, . - reset_handler

Linker Script for completeness

estack = 0x2001c000; /*End of RAM*/

MEMORY{
    FLASH ( rx  ) : ORIGIN = 0x08000000, LENGTH = 1M
    SRAM  ( rxw ) : ORIGIN = 0x20000000, LENGTH = 112K
    GPIO  ( rw  ) : ORIGIN = 0x40020000, LENGTH = 36020400
}

Am compiling it with

arm-none-eabi-gcc -x assembler-with-cpp -c -O0 -mcpu=cortex-m4 -mthumb -Wall core.s -o core.o -g
arm-none-eabi-gcc core.o -mcpu=cortex-m4 -mthumb -Wall --specs=nosys.specs -nostdlib -lgcc -T./stm32f407.ld -o pintoggle.bin -g

where the .s and .ld files are the ones I pasted above.

Relevant gdb output for doubters

(gdb) break 30
Breakpoint 1 at 0x8000012: file core.s, line 30.
Note: automatically using hardware breakpoints for read-only addresses.
(gdb) continue
Continuing.

Breakpoint 1, reset_handler () at core.s:30
30              STR r1, [r2]
(gdb) x 0x40020800
0x40020800:     0x00000000
(gdb) s
33              B dummy_loop
(gdb) x 0x40020800
0x40020800:     0x00000000
(gdb) info registers
r0             0x2001c000          536985600
r1             0x4                 4
r2             0x40020800          1073874944
r3             0x0                 0
r4             0x0                 0
r5             0x0                 0
r6             0x0                 0
r7             0x0                 0
r8             0x0                 0
r9             0x0                 0
r10            0x0                 0
r11            0x0                 0
r12            0x0                 0
sp             0x2001c000          0x2001c000
lr             0xffffffff          -1
pc             0x8000014           0x8000014 
xpsr           0x41000000          1090519040
msp            0x2001c000          0x2001c000
psp            0x0                 0x0
control        0x0                 0 '\000'
faultmask      0x0                 0 '\000'
basepri        0x0                 0 '\000'
primask        0x0                 0 '\000'
fpscr          0x0                 0

I'm not sure what I'm doing wrong. The program isn't particularly complex. Manipulating CPU-Registers works just fine, but writing into memory doesn't work at all.

The chip/board are fine. I threw some old C code I had lying around on it through the Keil environment and it worked just nicely. My goal is not to use any C at all, and just stick to assembly and gcc. This is what I want to learn.

EDIT: Since someone is bound to run into this question and ask even more. https://www.efton.sk/STM32/gotcha/index.html has a bunch of STM32 gotchas listed, where this question is #1 interestingly.

old_timer · Accepted Answer

You need to enable gpioc first in the RCC.

read-modify-write RCC_AHB1ENR 0x40023830 and set bit 2:

ldr r0, =0x40023830
ldr r1, [r0]
orr r1, #2
str r1, [r0]

Then you can:

LDR r2, =0x40020800
LDR r1, =0x04
STR r1, [r2]

Note/FYI this is likely NOT to work due to a race condition:

ldr r0, =0x40023830
ldr r1, [r0]
orr r1, #2
ldr r2, =0x40020800
ldr r3, =0x4
str r1, [r0]
str r3, [r2]

One needs some number of clocks between the enable and that clock enable spinning up the gpio, but this should work

ldr r0, =0x40023830
ldr r1, [r0]
orr r1, #2
ldr r2, =0x40020800
ldr r3, =0x4
str r1, [r0]
ldr r4, [r2]
str r3, [r2]

Or just do things sequentially as above (prep the write with a few instructions).

ldr r0, =0x40023830
ldr r1, [r0]
orr r1, #2
str r1, [r0]

LDR r2, =0x40020800
LDR r1, =0x04
STR r1, [r2]

Just use binutils there is no reason to mess with gcc at all.

flash.s

.thumb
.syntax unified

.global _start
_start:
.word 0x20001000
.word reset
.word loop
.word loop

.thumb_func
loop:   b .

.type reset, %function
reset:
    ldr r0, =0x40023830
    ldr r1, [r0]
    orr r1, #2
    str r1, [r0]

    LDR r2, =0x40020800
    LDR r1, =0x04
    STR r1, [r2]

    ldr r0, =0x40020818
    ldr r1, =0x00000002
    ldr r2, =0x00020000
loop_top:
    str r1,[r0]
    bl delay
    str r2,[r0]
    bl delay
    b loop_top

.thumb_func
delay:
    ldr r3,=0x200000
delay_loop:
    subs r3,#1
    bne delay_loop
    bx lr

flash.ld

MEMORY
{
    rom : ORIGIN = 0x08000000, LENGTH = 0x1000
    ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
    .text   : { *(.text*)   } > rom
    .bss    : { *(.bss*)    } > ram
}

build:

arm-none-eabi-as --warn --fatal-warnings -mcpu=cortex-m4 flash.s -o flash.o
arm-none-eabi-ld -nostdlib -nostartfiles -T flash.ld flash.o -o flash.elf
arm-none-eabi-objdump -D flash.elf > flash.list
arm-none-eabi-objcopy -O binary flash.elf flash.bin

check:

Disassembly of section .text:

08000000 <_start>:
 8000000:   20001000    andcs   r1, r0, r0
 8000004:   08000013    stmdaeq r0, {r0, r1, r4}
 8000008:   08000011    stmdaeq r0, {r0, r4}
 800000c:   08000011    stmdaeq r0, {r0, r4}

08000010 :
 8000010:   e7fe        b.n 8000010 

08000012 :
 8000012:   480d        ldr r0, [pc, #52]   ; (8000048 )
 8000014:   6801        ldr r1, [r0, #0]
 8000016:   f041 0102   orr.w   r1, r1, #2
 800001a:   6001        str r1, [r0, #0]
 800001c:   4a0b        ldr r2, [pc, #44]   ; (800004c )
 800001e:   f04f 0104   mov.w   r1, #4
 8000022:   6011        str r1, [r2, #0]
 8000024:   480a        ldr r0, [pc, #40]   ; (8000050 )
 8000026:   f04f 0102   mov.w   r1, #2
 800002a:   f44f 3200   mov.w   r2, #131072 ; 0x20000

0800002e :
 800002e:   6001        str r1, [r0, #0]
 8000030:   f000 f804   bl  800003c 
 8000034:   6002        str r2, [r0, #0]
 8000036:   f000 f801   bl  800003c 
 800003a:   e7f8        b.n 800002e 

0800003c :
 800003c:   f44f 1300   mov.w   r3, #2097152    ; 0x200000

08000040 :
 8000040:   3b01        subs    r3, #1
 8000042:   d1fd        bne.n   8000040 
 8000044:   4770        bx  lr
 8000046:   38300000    ldmdacc r0!, {} ; 
 800004a:   08004002    stmdaeq r0, {r1, lr}
 800004e:   08184002    ldmdaeq r8, {r1, lr}
 8000052:   Address 0x0000000008000052 is out of bounds.

Vectors look good, it will not hang immediately

I hate the unified syntax with a passion, certainly with binutils, but writing this:

delay_loop:
    sub r3,#1
    bne delay_loop

without the unified syntax under gnu assembler, actually produces the subs not sub and that confuses people (and they comment). So I used unified syntax above. If you are starting out probably should just enable unified syntax and learn it that way, sigh. (for where you are now, just enable it with that one line up front somewhere and keep doing what you are doing).

I have no use for gdb but perhaps it works there too, but if you telnet into openocd you can certainly do this: Either defeat your code by having reset go right into an infinite loop, or depending on the debug tool just reset and not start the code, but then you can write these control registers over the debug interface and see them work before you write the code to do the same thing. One can save some time (or can take longer, depends on your coding/debugging style).

mdw 0x40023830
mww 0x40023830 0x00000002 (need to read-modify-write any other ones in there from the prior read)
mdw 0x40020800
mww 0x40020800 0x4
mdw 0x40020800

and you should see the 0x4 there.

Then mess with 0x40020818 to change the state of the output pin.

DMA is not relevant here. You have the processor then its main ahb/etc bus and that goes through memory controller and other busses probably and such until it hits the logic that handles that control register. Then the enable for that clock goes through some number of gates to get to the clock enable for that logic block (gpioc in this case). It might take at least one clock for the enable to be latched and then probably at least another peripheral clock cycle to enable the clock gate to allow the clock through to the gpio. Now a particular contributor here would normally make this comment when seeing code like this. And some but not all of the STM32 documents specifically tell you how many us or ms you have to delay.

Writes can be fire and forget, the address and data is part of the transaction, and the first level memory controller closest to the chip can technically take those two items and tell the processor the write is complete (even though it is not) allowing it to do the next thing (another STR for example). Something like the clock control logic and a peripheral like a GPIO may or may not be down the same set of busses, but eventually they split off.

Just like sending two letters from your house to two addresses in the same town, you write the addresses on the envelope, put them in the mailbox, as far as you are concerned they are sent, you can go back in the house and do something else. These two might ride in the same trucks and planes all the way to the same post office in that destination town, but eventually will get split up and take different paths which can take a different amount of time.

This is called a race condition and they are very real and they happen more than we would like, but this is not something to start to panic about or worry about every time something does not work. In general the vendor will indicate in the docs or an errata that there is a race condition. Race as in track and field in the Olympics, a marathon, NASCAR, two or more things trying to get to the finish line first or in this case in order.

This is why simply removing a printf in some code, can have devastating results, as that printf caused a big delay between the thing before and after. Removing the delay, can cause a race.

It is not uncommon to have this specific situation where you have a control block, like a clock enable or something that controls address decoding (say you had a region of the processors address space that you can per their design point at something, but if you change what it points at then immediately try to talk the thing you just told it to point at you might have a race condition).

Writes are fire and forget, but reads have to go all the way to the peripheral and back, so the worst case time/path. Now I have worked on logic where the read path and write path split and you can have a race there too, but more of an exception not the rule. So if you for example write some control register, then read it back, one would hope the designers serialize this and the read happens after the write, it does a complete trip to the peripheral and back, so by the time it comes back and lets the processor continue, that register is definitely written.

In this case you have the possibly delay for the write to the rcc plus possibly different buses to get at the rcc vs the gpio causing possible timing differences, allowing for this race.

On the part you have or on other stm32 parts, you can try the experiment hinted to above. do the rcc enable with a STR, then the next instruction do a STR to moder to change the pin to an output, then you can take your time to change the state of the pin to say turn an led on. If it does work when you have some number of clocks delayed between the stores, but does not work when they are back to back. There you go.

The great mystery is why does the read work, STR, LDR, STR worked on the chips I found a race condition for. That does not make sense. The next great mystery is that if you read the moder registers that are non-zero, ones that you can get a good feel that you are actually reading that register (GPIOA and GPIOB moder in this case), with the gpio disabled in the rcc clock enable. The correct value comes back. This seems like a hack to me to get around the race condition. This could very well be a case of they had some number of chips they had a library that is already out there being used, then the next chip is in design and does not work, not going to go and force a new update of the hal for everyone for this one part when you can make a quick solution in the design.

So I only tested a handful of different parts but

STR (the write that enables the clock)
LDR (of the moder)
STR (of the moder)

Worked for the parts where

STR (the write that enables the clock)
STR (of the moder)

Failed.

DMA, direct memory access.

So let us think about say a PWM controller or a UART that have a small buffer, maybe only one value being used for that transaction/period and one value sitting in a transmit buffer waiting to be next. You as the programmer MIGHT depending on the design, have a few choices, have a loop of code that polls status registers waiting for an indication that says the holding register value is now being transmitted, and is now "empty". And then you ideally want/need to write the new value before the prior value is completely transmitted or that time period of the pwm or whatever happens.

You need to burn a lot of processing time if you want to keep the output of the peripheral at line rate or with no gaps or repeats or whatever that peripheral does. Now if your application is not doing anything else, and I will strongly argue if you are learning to use this peripheral, this is where you start. But the next option might be an interrupt you setup a handler and when the interrupt happens you pull from a larger buffer you are maintaining (in ram) to feed this peripheral, if you can insure that the handler is fast enough and no other higher priority things delay the handler starting (real time) then this will work.

The third thing which is not always available is DMA. You in some way tell the peripheral or a third party dma controller, that here is the block of data I want to send, and here is where I want it sent, the logic has to be designed with a connection to trigger the dma, causing one or whatever number of items to move into the peripheral. There can still be contention that causes delays and a race condition, but this is your best bet if you want/need to feed something into a peripheral every cycle.

A same similar concept where you hear about dma and folks also think this one is free to as it magically happens in the background and does not affect the processor. Say a memory transfer, setup a DMA engine that may be part of some system to do a transfer or data from one place to another, rather than having to do a memcpy or some other such thing (often used for things like moving a frame of pixel data to a video card or some other such thing, these days the video cards do a lot of the work for you generating pixels). Some systems this was free, many others it was not as it uses the same buses, so it causes the processor to have to stall if the bus is being used by the dma engine, definitely affecting the processor.

I have seen some where the processor is stopped/stalled completely while the dma transfer is happening, and I saw one wasteful one where the bus was some three quarters thing, at all times there was a clock cycle every so many that was reserved for dma if you happened to want to do a transfer and that cycle/block of time would then get used. Basically you were always being affected by the dma even if it was not happening (that was a DSP where execution consistency is very important and clearly more important than having code go as fast as you can).

DMA is used for many things, the m is wrong it does not always mean memory, it might be from one fifo (okay technically an sram) to some peripheral. It generally means, despite the name, another bus controller that can initiate bus transactions so that you do not have to create those transactions through the main processor via code.

Honestly you are not ready for DMA yet, nor interrupts. Poll your way through, create lots of throwaway code, even if you are working toward a specific project. For each peripheral for each feature you are going after, create one or more ad-hoc applications and figure out how that thing works. If you get into a mystery situation where everything is working and you add or remove even a single line of code that should have absolutely nothing to do with it and the thing breaks then if it were compiled code then question the output of the compiler, examine it.

If it is asm (or high level) it could be a race condition of some sort (it could be something else). They really are rare, often documented (eventually), BUT when they strike they can take a long time to figure out and sometimes you never are really sure but adding a delay fixed it, or reading the thing three times and taking the two that match, or whatever hack, and you move on with life. I was just warning you about something very real about a percentage of these STM32 designs. That could lead you to the same head scratching as what brought you here in the first place.

STM32F407 Assembly, STR not writing into memory

Answers (1)

Related Questions