Reputation: 58
I'm trying to get a parallel workflow to run in which I'm evaluating over 1000 parallel cases inside a ParallelGroup
. If I run on a low amount of cores it doesn't crash, but increasing the number of nodes at some point raises an error, which indicates that it relates to how the problem is partitioned.
I'm getting an error from the deep dungeons of OpenMDAO and PETSc, relating to the target indices when setting up the communication tables as far as I can see. Below is a print of the traceback of the error:
File "/home/frza/git/OpenMDAO/openmdao/core/group.py", line 454, in _setup_vectors
impl=self._impl, alloc_derivs=alloc_derivs)
File "/home/frza/git/OpenMDAO/openmdao/core/group.py", line 1456, in _setup_data_transfer
self._setup_data_transfer(my_params, None, alloc_derivs)
File "/home/frza/git/OpenMDAO/openmdao/core/petsc_impl.py", line 125, in create_data_xfer
File "/home/frza/git/OpenMDAO/openmdao/core/petsc_impl.py", line 397, in __init__
tgt_idx_set = PETSc.IS().createGeneral(tgt_idxs, comm=comm)
File "PETSc/IS.pyx", line 74, in petsc4py.PETSc.IS.createGeneral (src/petsc4py.PETSc.c:74696)
tgt_idx_set = PETSc.IS().createGeneral(tgt_idxs, comm=comm)
File "PETSc/arraynpy.pxi", line 121, in petsc4py.PETSc.iarray (src/petsc4py.PETSc.c:8230)
TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'
this answer:
led me to look for where you set up the tgt_idxs
vector to see whether its defined with the correct dtype PETSc.IntType
. But so far I only get Petsc has generated inconsistent data
errors when I try to set the dtype of arrays I think may be causing the error.
I've not yet tried to reinstall PETSc with --with-64-bit-indices
as suggested in the answer I linked to. Do you run PETSc configured this way?
edit: I've now set up a stripped down version of the problem that replicates the error I get:
import numpy as np
from openmdao.api import Component, Group, Problem, IndepVarComp, \
ParallelGroup
class Model(Component):
def __init__(self, nsec, nx, nch):
super(Model, self).__init__()
self.add_output('outputs', shape=[nx+1, nch*6*3*nsec])
def solve_nonlinear(self, params, unknowns, resids):
pass
class Aggregate(Component):
def __init__(self, nsec, ncase, nx, nch, nsec_env=12):
super(Aggregate, self).__init__()
self.ncase = ncase
for i in range(ncase):
self.add_param('outputs_sec%03d'%i, shape=[nx+1, nch*6*3*nsec])
for i in range(nsec):
self.add_output('aoutput_sec%03d' % i, shape=[nsec_env, 6])
def solve_nonlinear(self, params, unknowns, resids):
pass
class ParModel(Group):
def __init__(self, nsec, ncase, nx, nch, nsec_env=12):
super(ParModel, self).__init__()
pg = self.add('pg', ParallelGroup())
promotes = ['aoutput_sec%03d' % i for i in range(nsec)]
self.add('agg', Aggregate(nsec, ncase, nx, nch, nsec_env), promotes=promotes)
for i in range(ncase):
pg.add('case%03d' % i, Model(nsec, nx, nch))
self.connect('pg.case%03d.outputs'%i, 'agg.outputs_sec%03d'%i)
if __name__ == '__main__':
from openmdao.core.mpi_wrap import MPI
if MPI:
from openmdao.core.petsc_impl import PetscImpl as impl
else:
from openmdao.core.basic_impl import BasicImpl as impl
p = Problem(impl=impl, root=Group())
root = p.root
root.add('dlb', ParModel(20, 1084, 36, 6))
import time
t0 = time.time()
p.setup()
print 'setup time', time.time() - t0
Having done that I can also see that the data size ends up becoming enormous due to the many cases we evaluate. I'll see if we can somehow reduce the data sizes. I can't actually get this to run at all now, since it either crashes with an error:
petsc4py.PETSc.Errorpetsc4py.PETSc.Error: error code 75
[77] VecCreateMPIWithArray() line 320 in /home/MET/Python-2.7.10_Intel/opt/petsc-3.6.2/src/vec/vec/impls/mpi/pbvec.c
[77] VecSetSizes() line 1374 in /home/MET/Python-2.7.10_Intel/opt/petsc-3.6.2/src/vec/vec/interface/vector.c
[77] Arguments are incompatible
[77] Local size 86633280 cannot be larger than global size 73393408
: error code 75
or the TypeError
.
Upvotes: 0
Views: 119
Reputation: 754
The data sizes that you're running with are definitely larger than can be expressed by 32 bit indices, so recompiling with --with-64-bit-indices
makes sense if you're not able to decrease your data size. OpenMDAO uses PETSc.IntType internally for our indices, so they should become 64 bit in size if you recompile.
Upvotes: 1
Reputation: 5710
I've never used that option on petsc. A while back we did have some problems scaling up to larger numbers of cores, but we determined that the problem for us was with the OpenMPI compiling. Re-compiling OpenMDAO fixed our issues.
Since this error shows up on setup, we don't need to run to test the code. If you can provide us with the model that is showing the problem, and we can run it, then we can at least verify if the same problem is happening on our clusters.
It would be good to know how many cores you can successfully run on and at what point it breaks down too.
Upvotes: 0