Reputation: 77
The OpenMDAO problem that I'm running is quite complicated so I don't think it would be helpful to post the entire script. However, the basic setup is that my problem root is a ParallelFDGroup (not actually finite differencing for now--just running the problem once) that contains a few normal components as well as a parallel group. The parallel group is responsible for running 56 instances of an external code (one component per instance of the code). Strangely, when I run the problem with 4-8 processors, everything seems to work fine (sometimes even works with 10-12 processors). But when I try to use more processors (20+), I fairly consistently get the errors below. It provides two tracebacks:
Traceback (most recent call last):
File "opt_5mw.py", line 216, in <module>
top.setup() #call setup
File "/home/austinherrema/.local/lib/python2.7/site-packages/openmdao/core/problem.py", line 644, in setup
self.root._setup_vectors(param_owners, impl=self._impl, alloc_derivs=alloc_derivs)
File "/home/austinherrema/.local/lib/python2.7/site-packages/openmdao/core/group.py", line 476, in _setup_vectors
self._u_size_lists = self.unknowns._get_flattened_sizes()
File "/home/austinherrema/.local/lib/python2.7/site-packages/openmdao/core/petsc_impl.py", line 204, in _get_flattened_sizes
return self.comm.allgather(sizes)
File "MPI/Comm.pyx", line 1291, in mpi4py.MPI.Comm.allgather (src/mpi4py.MPI.c:109194)
File "MPI/msgpickle.pxi", line 746, in mpi4py.MPI.PyMPI_allgather (src/mpi4py.MPI.c:48575)
mpi4py.MPI.Exception: MPI_ERR_IN_STATUS: error code in status
Traceback (most recent call last):
File "opt_5mw.py", line 216, in <module>
top.setup() #call setup
File "/home/austinherrema/.local/lib/python2.7/site-packages/openmdao/core/problem.py", line 644, in setup
self.root._setup_vectors(param_owners, impl=self._impl, alloc_derivs=alloc_derivs)
File "/home/austinherrema/.local/lib/python2.7/site-packages/openmdao/core/group.py", line 476, in _setup_vectors
self._u_size_lists = self.unknowns._get_flattened_sizes()
File "/home/austinherrema/.local/lib/python2.7/site-packages/openmdao/core/petsc_impl.py", line 204, in _get_flattened_sizes
return self.comm.allgather(sizes)
File "MPI/Comm.pyx", line 1291, in mpi4py.MPI.Comm.allgather (src/mpi4py.MPI.c:109194)
File "MPI/msgpickle.pxi", line 749, in mpi4py.MPI.PyMPI_allgather (src/mpi4py.MPI.c:48609)
File "MPI/msgpickle.pxi", line 191, in mpi4py.MPI.Pickle.loadv (src/mpi4py.MPI.c:41957)
File "MPI/msgpickle.pxi", line 143, in mpi4py.MPI.Pickle.load (src/mpi4py.MPI.c:41248)
cPickle.BadPickleGet: 65
I am running under Ubuntu with OpenMDAO 1.7.3. I have tried running with both mpirun.openmpi (OpenRTE) 1.4.3 and mpirun (Open MPI) 1.4.3 and have gotten the same result in each case.
I found this post that seems to suggest that there is something wrong with the MPI installation. But if this were the case, it strikes me as strange that the problem would work for a small number of processors but not with a larger number. I also can run a relatively simple OpenMDAO problem (no external codes) with 32 processors without incident.
Because the traceback references OpenMDAO unknowns, I wondered if there are limitations on the size of OpenMDAO unknowns. In my case, each external code component has a few dozen array outputs that can be up to 50,000-60,000 elements each. Might that be problematic? Each external code component also reads the same set of input files. Could that be an issue as well? I have tried to ensure that read and write access is defined properly but perhaps that's not enough.
Any suggestions about what might be culprit in this situation are appreciated.
EDIT: I should add that I have tried running the problem without actually running the external codes (i.e. the components in the parallel group are called and set up but the external subprocesses are never actually created) and the problem persists.
EDIT2: I have done some more debugging on this issue and thought I should share the little that I have discovered. If I strip the problem down to only the parallel group containing the external code instances, the problem persists. However, if I reduce the components in the parallel group to basically nothing--just a print function for setup and for solve_nonlinear--then the problem can successfully "run" with a large number of processors. I started adding setup lines back in one by one to see what would create problems. I ran into issues when trying to add many large unknowns to the components. I can actually still add just a single large unknown--for example, this works:
self.add_output('BigOutput', shape=[100000])
But when I try to add too many large outputs like below, I get errors:
for i in range(100):
outputname = 'BigOutput{0}'.format(i)
self.add_output(outputname, shape=[100000])
Sometimes I just get a general segmentation violation error from PETSc. Other times I get a fairly length traceback that is too long to post here--I'll post just the beginning in case it provides any helpful clues:
*** glibc detected *** python2.7: free(): invalid pointer: 0x00007f21204f5010 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7da26)[0x7f2285f0ca26]
/home/austinherrema/miniconda2/lib/python2.7/lib-dynload/../../libsqlite3.so.0(sqlite3_free+0x4f)[0x7f2269b7754f]
/home/austinherrema/miniconda2/lib/python2.7/lib-dynload/../../libsqlite3.so.0(+0x1cbbc)[0x7f2269b87bbc]
/home/austinherrema/miniconda2/lib/python2.7/lib-dynload/../../libsqlite3.so.0(+0x54d6c)[0x7f2269bbfd6c]
/home/austinherrema/miniconda2/lib/python2.7/lib-dynload/../../libsqlite3.so.0(+0x9d31f)[0x7f2269c0831f]
/home/austinherrema/miniconda2/lib/python2.7/lib-dynload/../../libsqlite3.so.0(sqlite3_step+0x1bf)[0x7f2269be261f]
/home/austinherrema/miniconda2/lib/python2.7/lib-dynload/_sqlite3.so(pysqlite_step+0x2d)[0x7f2269e4306d]
/home/austinherrema/miniconda2/lib/python2.7/lib-dynload/_sqlite3.so(_pysqlite_query_execute+0x661)[0x7f2269e404b1]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8942)[0x7f2286c6a5a2]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x86c3)[0x7f2286c6a323]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x86c3)[0x7f2286c6a323]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e)[0x7f2286c6b1ce]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(+0x797e1)[0x7f2286be67e1]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x53)[0x7f2286bb6dc3]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(+0x5c54f)[0x7f2286bc954f]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x53)[0x7f2286bb6dc3]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x43)[0x7f2286c60d63]
/home/austinherrema/miniconda2/bin/../lib/libpython2.7.so.1.0(+0x136652)[0x7f2286ca3652]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7f2286957e9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f2285f8236d]
======= Memory map: ========
00400000-00401000 r-xp 00000000 08:03 9706352 /home/austinherrema/miniconda2/bin/python2.7
00600000-00601000 rw-p 00000000 08:03 9706352 /home/austinherrema/miniconda2/bin/python2.7
00aca000-113891000 rw-p 00000000 00:00 0 [heap]
7f21107d6000-7f2241957000 rw-p 00000000 00:00 0
etc...
Upvotes: 0
Views: 331
Reputation: 5710
its hard to guess whats going on here, but if it works for a small number of processors and not on larger ones one guess might be that the issue shows up when you use more than one node, and data has to get transfered across the network. I have seen bad MPI compilations that behaved this way. Things would work if I kept the job to one node, but broke on more than one.
The traceback shows that you're not even getting through setup. So its not likely to be anything in your external code or any other components run method.
If you're running on a cluster, are you compiling your own MPI? You usually need to compile very with very specific options/libraries for any kind of HPC library. But most HPC systems provide modules you can load that have mpi pre-compiled.
Upvotes: 0