Reputation: 51
I'm having some issue with some code (see below). When all optimization are off (-O0) the code behaves as i would expect however when I turn on optimization the code does not perform as I would expect. So I'm posting here to see if someone sees a reason why the code would operate differently (incorrectly) when optimizations are on. Let me explain what the code is doing...
TPM0->CNT is a timer register and is of volatile uint32_t type. When TPM0->CNT reaches it's maximum value and rolls over an interrupt is generated. Within this interrupt sg_CpuCount is incremented by the number of timer cycles that have elapsed since the last interrupt.
The GetCpuCount function's job is to return the instantaneous count (i.e. sg_CpuCount + TPM0->CNT) however because interrupts can happen at any moment and the timer does not stop care must be taken when reading/computing this. The two components are read and then the first component is once again tested to see if it changed. If it changed its known a interrupt fired in between the first and 2nd reading and the process is repeated, however if they are the same it is known no interrupt occurred and the addition of the components is returned.
Is the logic sound? Is optimization generating bad code? any insights? Thanks in advance!
edit: I should mention the behavior I notice when optimizations are on. Basically subsequent calls to GetCpuCount are not always monotonic increasing like they are when optimizations are off. That is later calls to GetCpuCount might return a smaller number than previous calls - that should not be possible.
edit2: I should mention the micro this code runs on is a 32bit architecture (and I'm using 64bit values). I still think it should be ok but it is a point to think about.
static volatile int64_t sg_CpuCount;
void TPM0_IRQHandler(void)
{
TPM0->SC |= TPM_SC_TOF(1); //clear interrupt flag
sg_CpuCount += CPU_CLOCK / TICKRATE_HZ;
}
int64_t GetCpuCount(void)
{
for (;;) {
int64_t tmpCpuCount = sg_CpuCount;
uint32_t tmpCnt = TPM0->CNT;
if (tmpCpuCount == sg_CpuCount) {
return tmpCpuCount + tmpCnt;
}
}
}
edit3: adding asm
optimized (-0s)
GetCpuCount:
.LFB36:
.loc 2 80 0
.cfi_startproc
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
push {r4, r5, r6, lr}
.cfi_def_cfa_offset 16
.cfi_offset 4, -16
.cfi_offset 5, -12
.cfi_offset 6, -8
.cfi_offset 14, -4
.LBB2:
.loc 2 82 0
ldr r4, .L20
.loc 2 83 0
ldr r6, .L20+4
.L18:
//int64_t tmpCpuCount = sg_CpuCount;
ldr r0, [r4]
ldr r1, [r4, #4]
//uint32_t tmpCnt = TPM0->CNT;
ldr r5, [r6, #4]
//if (tmpCpuCount == sg_CpuCount) {
ldr r2, [r4]
ldr r3, [r4, #4]
cmp r0, r2
bne .L18
cmp r1, r3
bne .L18
//return tmpCpuCount + tmpCnt;
movs r0, r5
movs r1, #0
.LBE2:
.loc 2 88 0
@ sp needed
.LBB3:
.loc 2 85 0
adds r0, r0, r2
adcs r1, r1, r3
.LBE3:
.loc 2 88 0
pop {r4, r5, r6, pc}
.L21:
.align 2
.L20:
.word sg_CpuCount
.word 1073971200
.cfi_endproc
non optimized (-00)
GetCpuCount:
.LFB36:
.loc 2 80 0
.cfi_startproc
@ args = 0, pretend = 0, frame = 16
@ frame_needed = 1, uses_anonymous_args = 0
push {r4, r7, lr}
.cfi_def_cfa_offset 12
.cfi_offset 4, -12
.cfi_offset 7, -8
.cfi_offset 14, -4
sub sp, sp, #20
.cfi_def_cfa_offset 32
add r7, sp, #0
.cfi_def_cfa_register 7
.L19:
.LBB2:
//int64_t tmpCpuCount = sg_CpuCount;
ldr r3, .L21
ldr r4, [r3, #4]
ldr r3, [r3]
str r3, [r7, #8]
str r4, [r7, #12]
//uint32_t tmpCnt = TPM0->CNT;
ldr r3, .L21+4
ldr r3, [r3, #4]
str r3, [r7, #4]
//if (tmpCpuCount == sg_CpuCount) {
ldr r3, .L21
ldr r4, [r3, #4]
ldr r3, [r3]
ldr r0, [r7, #8]
cmp r0, r3
bne .L19
ldr r0, [r7, #12]
cmp r0, r4
bne .L19
//return tmpCpuCount + tmpCnt;
ldr r3, [r7, #4]
movs r1, r3
movs r3, #0
movs r2, r3
ldr r3, [r7, #8]
ldr r4, [r7, #12]
adds r3, r3, r1
adcs r4, r4, r2
.LBE2:
.loc 2 88 0
movs r0, r3
movs r1, r4
mov sp, r7
add sp, sp, #20
@ sp needed
pop {r4, r7, pc}
.L22:
.align 2
.L21:
.word sg_CpuCount
.word 1073971200
.cfi_endproc
Upvotes: 2
Views: 161
Reputation: 51
After analyzing the ASM for optimized and non optimized code I noticed one main difference. One loads the 64bit int high word first and the other does low word first. So I believe user5329483 is correct and I'll explain the exact sequence that could cause the problem...
read sg_CpuCount low
read sg_CpuCount high
read TPM0->CNT
read sg_CpuCount low
***interrupt***
read sg_CpuCount high
compare first sg_CpuCount with second sg_CpuCount (equal!!!!!)
Since the increment of sg_CpuCount in the interrupt may only change the lower word the above sequence fails to detect the interrupt happened.
The non optimized code is as follows:
read sg_CpuCount high
read sg_CpuCount low
read TPM0->CNT
read sg_CpuCount high
***interrupt***
read sg_CpuCount low
compare first sg_CpuCount with second sg_CpuCount (**NOT** equal!!!!!)
This code correctly detected the interrupt happened.
So the optimization did in fact change the code - however both code gens are equally valid from the perspective of the compiler. So the answer is to either explicitly define the order of operations by splitting the 64bit int into 2 32bit variables or better yet just use user5329483's suggested code.
edit: unfortunately 32bits is not enough to run for long enough times. Anyway the problem is well understood now.
Upvotes: 1
Reputation: 1272
If you are on a 32bit machine you cannot guarantee 64bit operations to be atomic. Due to optimizations timings may shift. And never ever call the function with interrupts disabled.
static volatile uint32_t sg_CpuCount;
void TPM0_IRQHandler(void)
{
TPM0->SC |= TPM_SC_TOF(1); //clear interrupt flag
sg_CpuCount++; //count interrupts
}
int64_t GetCpuCount(void)
{
for (;;) {
uint32_t tmpCpuCount = sg_CpuCount;
uint32_t tmpCnt = TPM0->CNT;
if (tmpCpuCount == sg_CpuCount) {
return (int64_t)tmpCpuCount * CPU_CLOCK / TICKRATE_HZ + tmpCnt;
}
}
}
Upvotes: 1