Phuocdh90
Phuocdh90

Reputation: 51

Simple MPI program fail with large number of processes

This is my code:

#include "mpi.h"
#include <stdio.h>

int main (int argc, char** argv) {

   int  numtasks, rank; 

   MPI_Init(&argc,&argv);

   MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
   MPI_Comm_rank(MPI_COMM_WORLD,&rank);
   // the code fail with or without printf
   printf ("Number of tasks= %d My rank= %d\n", numtasks,rank);

   MPI_Barrier(MPI_COMM_WORLD);
   MPI_Finalize();
   return 0;
}

And this is how i run it and the output:

$ mpirun -n 160 ./mpi_example1
[proxy:0:0@ubuntu] send_cmd_downstream (./pm/pmiserv/pmip_pmi_v1.c:80): assert (!closed) failed
[proxy:0:0@ubuntu] fn_get (./pm/pmiserv/pmip_pmi_v1.c:349): error sending PMI response
[proxy:0:0@ubuntu] pmi_cb (./pm/pmiserv/pmip_cb.c:327): PMI handler returned error
[proxy:0:0@ubuntu] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@ubuntu] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
[mpiexec@ubuntu] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert (!closed) failed
[mpiexec@ubuntu] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@ubuntu] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
[mpiexec@ubuntu] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion

When I run the code with -n 128 or lower, it works fine. I also tried to run the code on a 32 cores x 8 nodes computer and able to run up to -n 192, when i try -n 224 it fail...

Any suggestion? Thanks.

Upvotes: 3

Views: 5236

Answers (2)

luk32
luk32

Reputation: 16090

This is not a definitive answer, but its way too long for a comment.

I took a look at the source of the failed assert. The codebase is slightly different, but I think it's close enough. Your error says assert failed at line 80 while here, the assertion HYDU_ASSERT(!closed, status); lies in line 82.

The offending call is located at line 77:

status = HYDU_sock_write(fd, cmd, strlen(cmd), &sent, &closed, HYDU_SOCK_COMM_MSGWAIT);

Now, the code for HYDU_sock_write says that the closed flag will be set and the function will abort operation when

write(fd, (char *) buf + *sent, maxlen - *sent); @line 278 fails with errno == ECONNRESET.

Now this documentation for write says: "[ECONNRESET] A write was attempted on a socket that is not connected."

Are you sure the network is working fine? It seems like sockets get disconnected.

Upvotes: 0

Massimo Cafaro
Massimo Cafaro

Reputation: 25429

The problem may be related to the maximum number of processes that can be spawned by your shell. How to modify this setting depends on the type of shell and on the operating systems. If you are using cshell or tcshell you can verify your current setting using from the common line the "limit" command. Changing the setting may be done both at the user level or at the root level (there are both soft and hard limits).

Upvotes: 1

Related Questions