MPI errors "MPI_ERR_IN_STATUS" and "BadPickleGet" in OpenMDAO when running external codes in parallel with many processors

Question

The OpenMDAO problem that I'm running is quite complicated so I don't think it would be helpful to post the entire script. However, the basic setup is that my problem root is a ParallelFDGroup (not actually finite differencing for now--just running the problem once) that contains a few normal components as well as a parallel group. The parallel group is responsible for running 56 instances of an external code (one component per instance of the code). Strangely, when I run the problem with 4-8 processors, everything seems to work fine (sometimes even works with 10-12 processors). But when I try to use more processors (20+), I fairly consistently get the errors below. It provides two tracebacks:

Traceback (most recent call last):
  File "opt_5mw.py", line 216, in 
    top.setup()   #call setup
  File "/home/austinherrema/.local/lib/python2.7/site-packages/openmdao/core/problem.py", line 644, in setup
    self.root._setup_vectors(param_owners, impl=self._impl, alloc_derivs=alloc_derivs)
  File "/home/austinherrema/.local/lib/python2.7/site-packages/openmdao/core/group.py", line 476, in _setup_vectors
    self._u_size_lists = self.unknowns._get_flattened_sizes()
  File "/home/austinherrema/.local/lib/python2.7/site-packages/openmdao/core/petsc_impl.py", line 204, in _get_flattened_sizes
    return self.comm.allgather(sizes)
  File "MPI/Comm.pyx", line 1291, in mpi4py.MPI.Comm.allgather (src/mpi4py.MPI.c:109194)
  File "MPI/msgpickle.pxi", line 746, in mpi4py.MPI.PyMPI_allgather (src/mpi4py.MPI.c:48575)
mpi4py.MPI.Exception: MPI_ERR_IN_STATUS: error code in status

Traceback (most recent call last):
  File "opt_5mw.py", line 216, in 
    top.setup()   #call setup
  File "/home/austinherrema/.local/lib/python2.7/site-packages/openmdao/core/problem.py", line 644, in setup
    self.root._setup_vectors(param_owners, impl=self._impl, alloc_derivs=alloc_derivs)
  File "/home/austinherrema/.local/lib/python2.7/site-packages/openmdao/core/group.py", line 476, in _setup_vectors
    self._u_size_lists = self.unknowns._get_flattened_sizes()
  File "/home/austinherrema/.local/lib/python2.7/site-packages/openmdao/core/petsc_impl.py", line 204, in _get_flattened_sizes
    return self.comm.allgather(sizes)
  File "MPI/Comm.pyx", line 1291, in mpi4py.MPI.Comm.allgather (src/mpi4py.MPI.c:109194)
  File "MPI/msgpickle.pxi", line 749, in mpi4py.MPI.PyMPI_allgather (src/mpi4py.MPI.c:48609)
  File "MPI/msgpickle.pxi", line 191, in mpi4py.MPI.Pickle.loadv (src/mpi4py.MPI.c:41957)
  File "MPI/msgpickle.pxi", line 143, in mpi4py.MPI.Pickle.load (src/mpi4py.MPI.c:41248)
cPickle.BadPickleGet: 65

I am running under Ubuntu with OpenMDAO 1.7.3. I have tried running with both mpirun.openmpi (OpenRTE) 1.4.3 and mpirun (Open MPI) 1.4.3 and have gotten the same result in each case.

I found this post that seems to suggest that there is something wrong with the MPI installation. But if this were the case, it strikes me as strange that the problem would work for a small number of processors but not with a larger number. I also can run a relatively simple OpenMDAO problem (no external codes) with 32 processors without incident.

Because the traceback references OpenMDAO unknowns, I wondered if there are limitations on the size of OpenMDAO unknowns. In my case, each external code component has a few dozen array outputs that can be up to 50,000-60,000 elements each. Might that be problematic? Each external code component also reads the same set of input files. Could that be an issue as well? I have tried to ensure that read and write access is defined properly but perhaps that's not enough.

Any suggestions about what might be culprit in this situation are appreciated.

EDIT: I should add that I have tried running the problem without actually running the external codes (i.e. the components in the parallel group are called and set up but the external subprocesses are never actually created) and the problem persists.

EDIT2: I have done some more debugging on this issue and thought I should share the little that I have discovered. If I strip the problem down to only the parallel group containing the external code instances, the problem persists. However, if I reduce the components in the parallel group to basically nothing--just a print function for setup and for solve_nonlinear--then the problem can successfully "run" with a large number of processors. I started adding setup lines back in one by one to see what would create problems. I ran into issues when trying to add many large unknowns to the components. I can actually still add just a single large unknown--for example, this works:

self.add_output('BigOutput', shape=[100000])

But when I try to add too many large outputs like below, I get errors:

for i in range(100):
    outputname = 'BigOutput{0}'.format(i)
    self.add_output(outputname, shape=[100000])

Sometimes I just get a general segmentation violation error from PETSc. Other times I get a fairly length traceback that is too long to post here--I'll post just the beginning in case it provides any helpful clues:

*** glibc detected *** python2.7: free(): invalid pointer: 0x00007f21204f5010 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7da26)[0x7f2285f0ca26]
/home/austinherrema/miniconda2/lib/python2.7/lib-dynload/../../libsqlite3.so.0(sqlite3_free+0x4f)[0x7f2269b7754f]
/home/austinherrema/miniconda2/lib/python2.7/lib-dynload/../../libsqlite3.so.0(+0x1cbbc)[0x7f2269b87bbc]
/home/austinherrema/miniconda2/lib/python2.7/lib-dynload/../../libsqlite3.so.0(+0x54d6c)[0x7f2269bbfd6c]
/home/austinherrema/miniconda2/lib/python2.7/lib-dynload/../../libsqlite3.so.0(+0x9d31f)[0x7f2269c0831f]
/home/austinherrema/miniconda2/lib/python2.7/lib-dynload/../../libsqlite3.so.0(sqlite3_step+0x1bf)[0x7f2269be261f]
/home/austinherrema/miniconda2/lib/python2.7/lib-dynload/_sqlite3.so(pysqlite_step+0x2d)[0x7f2269e4306d]
/home/austinherrema/miniconda2/lib/python2.7/lib-dynload/_sqlite3.so(_pysqlite_query_execute+0x661)[0x7f2269e404b1]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8942)[0x7f2286c6a5a2]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x86c3)[0x7f2286c6a323]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x86c3)[0x7f2286c6a323]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e)[0x7f2286c6b1ce]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(+0x797e1)[0x7f2286be67e1]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x53)[0x7f2286bb6dc3]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(+0x5c54f)[0x7f2286bc954f]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x53)[0x7f2286bb6dc3]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x43)[0x7f2286c60d63]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(+0x136652)[0x7f2286ca3652]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7f2286957e9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f2285f8236d]
======= Memory map: ========
00400000-00401000 r-xp 00000000 08:03 9706352                            /home/austinherrema/miniconda2/bin/python2.7
00600000-00601000 rw-p 00000000 08:03 9706352                            /home/austinherrema/miniconda2/bin/python2.7
00aca000-113891000 rw-p 00000000 00:00 0                                 [heap]
7f21107d6000-7f2241957000 rw-p 00000000 00:00 0
etc...

MPI errors "MPI_ERR_IN_STATUS" and "BadPickleGet" in OpenMDAO when running external codes in parallel with many processors

Answers (1)

Related Questions

MPI errors &quot;MPI_ERR_IN_STATUS&quot; and &quot;BadPickleGet&quot; in OpenMDAO when running external codes in parallel with many processors

Answers (1)

Related Questions

MPI errors "MPI_ERR_IN_STATUS" and "BadPickleGet" in OpenMDAO when running external codes in parallel with many processors