Reputation: 12532

How to debug a watchdog timeout

I have a watchdog in my microcontroller that if it is not kicked, will reset the processor. My applications runs fine for a while but will eventually reset because the watchdog did not get kicked. If I step through the program it works fine.

What are some ways to debug this?

EDIT: Conclusion: The way I found my bug was the watchdog breadcrumbs.

I am using a PIC that has a high and low ISR vector. The High vector was suppose to handle the LED matrix and the Low vector was to handle the timer tick. But I put both ISR handlers in the high vector. So when I disabled the LED matrix ISR and the timer tick ISR needed service, the processor would be stuck in the low ISR to handle the timer tick, but the timer tick handler was not there.

The breadcrumbs limited my search down to the function that handled the led matrix and specifically disabling the LED matrix interrupt.

Upvotes: 12

Answers (10)

mvds

Reputation: 47074

Expanding on the excellent accepted answer, for my application I took the approach of examining RAM after a watchdog reset a bit further. As this text is way too long for a comment, I'm adding this as an additional answer:

Our application is prefixed by a custom bootloader, that provides a few functions such as OTA firmware update. By setting the initial stack pointer for this bootloader to 1 kb before the end of RAM, enough of the stack remains in RAM to reconstruct a backtrace in case the main firmware is reset by the watchdog. The bootloader then needs to identify the reboot cause to be the watchdog, and copy the last kb of RAM to some designated flash area, from which it can be recovered.

Reconstructing the backtrace is a bit cumbersome, as there is no PC to start from, so in practice the stack should be (manually) unwound from the bottom to the top, and some educated guessing may be needed to locate the point at which the application stalled. (This is not necessarily the point at which stale data is seen and unwinding fails!)

This approach helped me to systematically pinpoint an issue that occurred only very sporadically. To log additional 'breadcrumbs' on the stack, simply insert things like:

__attribute((unused)) volatile uint32_t _state[4];
_state[0] = 0x57a11ed; // magic value to aid manual unwinding
_state[1] = RCC->CSR;
_state[2] = count++; // maybe we're in a runaway loop?
// etc.

In applications without a separate bootloader, the initial stack pointer could be set to 1 kb before the end of RAM, only to be changed to end of RAM after regular boot (which of course is non-trivial!). Then, in case of a watchdog reset, the application may simply store/transmit the last kb of RAM for offline analysis.

Upvotes: 0

BruzsaPeti

Reputation: 41

Place a breakpoint in the watchdog ISR and when it stops there, you can check the call stack. This gives you what was being executed when the watchdog happened + any other conditions: state variables, registers, buffers, etc.

Upvotes: 0

U. Windl

Reputation: 4325

You could attach strace (option -p) to your running process, watching when it stops writing to the file descriptor that opened /dev/watchdog. You can filter strace output using option -e. See the manual page for details.

Upvotes: 0

shivakeerthan

Reputation: 11

You can insert a while loop in your code and toggle an LED inside the while loop. This is the effective way to check if the board is resetting.

Upvotes: 1

Michael Kohne

Reputation: 12044

I'd use an extra output pin, set high then low at appropriate points in the code to limit the scope of where I'm looking. Then I'd trace it on a digital scope or logic analyzer. This is equivalent to the breadcrumbs method mentioned by another poster, but you'll be able to time correlate to the reset pulse much better.

Upvotes: 2

Mikeage

Reputation: 6564

Many software watchdogs are automatically disabled when you attach a debugger (to prevent it from restarting while the debugger has the application halted).

That said, here are some basics:

Is this a multithreaded applications? Are you using a RT scheduler? If so, is your watchdog task starved?

Make sure your watchdog task can't be stuck on anything (pending semaphore, waiting for a message, etc). Sometimes, functions can block in ways you might not expect; for example, I have a Linux platform I'm working on right now where I can get printf to block quite easily.

If it's single threaded, a profiler may help you identify timing issues.

If this is a new system, make sure the watchdog works correctly; test simple code that just hits the WD and then sleeps in an infinite loop.

Upvotes: 4

billmcc

Reputation: 711

Usually the watchdog task/thread runs at a low priority. So if the watchdog isn't getting kicked, this should be because the processor is busy doing something else - probably something that it shouldn't be doing.

It would be really useful to dump out the execution context (local stack, scheduling state etc.) for each task/thread just before the processor resets. With a bit of luck and work, you'll be able to determine what is preventing the watchdog task from kicking the timer.

Upvotes: 2

Stephen Friederichs

Reputation: 1059

I use state-based programming and a trick I've always wanted to employ was to reserve one output port for the current state in binary. Then hook up a logic analyzer and see the timings of the state changes. You could do something similar here: Do what Robert said and create a global variable and change its value at key points - preferably with a function that immediately sets the value of the port to the current state (ie changeState(nextState); ) Change the state when you enter the function that kicks the dog, then change it back to the previous state before you leave the function. You should be able to see from what functions it DOESN'T get kicked and then you can work on those.

Good luck, it sounds like a timing problem and those are tough to solve.

Upvotes: 2

Robert Deml

Reputation: 12532

Add an uninitialized global variable that is set to different values throughout the code. Specifically, set it before and after major function calls.

Put a breakpoint at the beginning of main.

When the processor resets the global variable will still have the last value it was set to. Keep adding these "bread crumbs" to narrow down to the problem function.

Upvotes: 12

xtofl

Reputation: 41509

Question every assumption you make, twice:

Make sure the watchdog is kicked (I don't know the logging facilities on the processor).
Make sure the watchdog, when kicked, doesn't reset the processor.

And wonder what differences there are between 'stepping through' and running alone; timing constraints will surely matter.

Upvotes: 0

How to debug a watchdog timeout

Answers (10)

Related Questions