teekarna
teekarna

Reputation: 1034

Debugging parallel Python programs (mpi4py)

I have an mpi4py program that hangs intermittently. How can I trace what the individual processes are doing?

I can run the program in different terminals, for example using pdb

mpiexec -n 4 xterm -e "python -m pdb my_program.py"

But this gets cumbersome if the issue only manifests with a large number of processes (~80 in my case). In addition, it's easy to catch exceptions with pdb but I'd need to see the trace to figure out where the hang occurs.

Upvotes: 8

Views: 2661

Answers (1)

teekarna
teekarna

Reputation: 1034

The Python trace module allows you to trace program execution. In order to store the trace of each process separately, you need to wrap your code in a function:

def my_program(*args, **kwargs):
    # insert your code here
    pass

And then run it with trace.Trace.runfunc:

import sys
import trace

# define Trace object: trace line numbers at runtime, exclude some modules
tracer = trace.Trace(
    ignoredirs=[sys.prefix, sys.exec_prefix],
    ignoremods=[
        'inspect', 'contextlib', '_bootstrap',
        '_weakrefset', 'abc', 'posixpath', 'genericpath', 'textwrap'
    ],
    trace=1,
    count=0)

# by default trace goes to stdout
# redirect to a different file for each processes
sys.stdout = open('trace_{:04d}.txt'.format(COMM_WORLD.rank), 'w')

tracer.runfunc(my_program)

Now the trace of each process will be written in a separate file trace_0001.txt etc. Use ignoredirs and ignoremods arguments to omit low level calls.

Upvotes: 3

Related Questions