Reputation: 15100
I have a nasty Heisenbug with an application. In general terms, it's a parallel Fortran program that spawns a parallel C++ program with MPI-2's MPI_Comm_Spawn
functionality and at some point it looks like a buffer is overrun somewhere because strange variables end up with even stranger (ie. shifted) values or becoming uninitialized the 2nd or 3rd time it's used (for example, my counter in a DO
loop loses it's value between iterations in a part of the code totally unrelated to the data from the coupling).
Valgrind reports nothing. Electric Fence reports nothing. mtrace()
shows nothing. Both GNU and Intel compiler suites show the same problems but neither can catch why or where. Optimized and debug show different problems. Both mpich and OpenMPI show the same problems. gdb, idb and Intel Inspector don't catch anything. Adding print statements makes the crash change locations but it still happens.
Every unit test and validation test passes on each program independently. It's the interaction between them that seems to be the problem. But no tool I have used can give me any indications why or where.
I am at a total loss. What the heck do you do when every tool and trick you know fails? Are there any other tools out there that I may have missed? I'm about to nuke it all and start over again, hoping I don't make whatever mistake this is a second time.
Upvotes: 0
Views: 78
Reputation: 96139
The only thing you can do is start from a minimum working set and add until it breaks - or sometimes, if you are really lucky, gets to the final required result by a slightly different path
Alternatively you could turn to drink
Upvotes: 1