Debugging data abort and prefetch abort exceptions on Windows CE 6

Question

We're running a rather complex software on Windows CE and Windows Mobile for mobile data acquisition on different device types. On the only device type with Windows CE 6.0 installed, our client randomly freezes the operating system (so a warm-boot is required). The client might runs well for one or two days before freezing, but it can be five minutes as well (already checked on handle and memory leaks). In the log file of the device manufacturer, such entries appear when the device freezes:

Exception 'Data Abort' (4): Thread-Id=070a003e(pth=89ca07e0), Proc-Id=0709003e(pprc=8a01d3d0) 'OurClient.exe', VM-active=0709003e(pprc=8a01d3d0) 'OurClient.exe' PC=41a66b28(mscoree3_5.dll+0x00056b28) RA=41a64ab4(mscoree3_5.dll+0x00054ab4) SP=0003e28c, BVA=00000132

The messages differ from time to time (I'd say I counted 20 different ones so far, with exceptions in kernel.dll, k.core.dll or nk.exe).

So my question is basically, how can I debug such an error occurring in the depths of the .NET framework and the kernel? For example, how can I translate the program counter into a method inside the mscorlib (same for the return address)? Is it likely that our program doesn't work well with CE 6 or could this be a driver issue as well?

Update: It turned out, that one of the device drivers interferes with our keyboard hook implementation.

ctacke · Accepted Answer

As Alan points out, if you don't have the symbols an source for where things broke (and with mscoree3_5.dll, you don't) then the abort information is pretty useless. Even with the source, you can't walk it back without the compiler symbol output.

At this point you can only take educated guesses. The fact that the exception info all looks valid (i.e. the RA or SP is non-zero) indicates to me that it's not a stack issue, it's more likely a data issue (maybe an alignment, maybe a bad read or write pointer).

My guess is that it's from an incorrect P/Invoke. The fact that it "moves" indicates that it's likely an object reference or address passed to a P/Invoke going invalid due to collection or compaction.

Imagine the following scenario.

You have a native API that takes in a pointer to some data blob that said API will use not just immediately, but periodically. Maybe it reads from it or writes to it, but the key is that the API needs the data not just synchronously at the time of the call. The API necessarily stores that pointer for it to use at a later point.

You create some managed code that calls this API though a P/Invoke. To pass the data pointer you define a class that represents the data, create an instance of the class and pass it across. Let's say, for the sake of the example, that the address is 0x500.

You run your app, the API is called and all is well. The API reads from 0x500 and goes about its business.

Until the app triggers a GC. Now the GC says "hey, I have some empty space in the heap, I'll move some stuff around to fix that". It moves the managed object so that it's now at 0x200 and frees the memory at 0x500. At some point after that, the API goes to it's pointer, still at 0x500 and does a read. The OS says "hey, that unallocated space, you can't do that!" and it aborts.

The fix to this scenario is to use a Pinned GCHandle. Instead of passing the class to the API, you pin the class and pass in the GCHandle's address, which the GC cannot move during collection or compaction. This ensures that the address remains constant for the like for the GCHandle and is safe to pass across the native boundary.

Notice that this scenario happens without using unsafe code at all, though you could do the same with unsafe code. In fact I'd argue that with unsafe code you'd likely be more cognizant of where it might happen and that could would be "safer" than the code not marked as unsafe. Avoiding the unsafe keyword doesn't prevent unsafe code.

Debugging data abort and prefetch abort exceptions on Windows CE 6

Answers (2)

Related Questions