Reputation: 51
This is my code:
#include "mpi.h"
#include <stdio.h>
int main (int argc, char** argv) {
int numtasks, rank;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
// the code fail with or without printf
printf ("Number of tasks= %d My rank= %d\n", numtasks,rank);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
And this is how i run it and the output:
$ mpirun -n 160 ./mpi_example1
[proxy:0:0@ubuntu] send_cmd_downstream (./pm/pmiserv/pmip_pmi_v1.c:80): assert (!closed) failed
[proxy:0:0@ubuntu] fn_get (./pm/pmiserv/pmip_pmi_v1.c:349): error sending PMI response
[proxy:0:0@ubuntu] pmi_cb (./pm/pmiserv/pmip_cb.c:327): PMI handler returned error
[proxy:0:0@ubuntu] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@ubuntu] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
[mpiexec@ubuntu] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert (!closed) failed
[mpiexec@ubuntu] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@ubuntu] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
[mpiexec@ubuntu] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion
When I run the code with -n 128 or lower, it works fine. I also tried to run the code on a 32 cores x 8 nodes computer and able to run up to -n 192, when i try -n 224 it fail...
Any suggestion? Thanks.
Upvotes: 3
Views: 5236
Reputation: 16090
This is not a definitive answer, but its way too long for a comment.
I took a look at the source of the failed assert. The codebase is slightly different, but I think it's close enough. Your error says assert failed at line 80 while here, the assertion HYDU_ASSERT(!closed, status);
lies in line 82.
The offending call is located at line 77:
status = HYDU_sock_write(fd, cmd, strlen(cmd), &sent, &closed, HYDU_SOCK_COMM_MSGWAIT);
Now, the code for HYDU_sock_write
says that the closed
flag will be set and the function will abort operation when
write(fd, (char *) buf + *sent, maxlen - *sent);
@line 278 fails with errno == ECONNRESET
.
Now this documentation for write
says: "[ECONNRESET]
A write was attempted on a socket that is not connected."
Are you sure the network is working fine? It seems like sockets get disconnected.
Upvotes: 0
Reputation: 25429
The problem may be related to the maximum number of processes that can be spawned by your shell. How to modify this setting depends on the type of shell and on the operating systems. If you are using cshell or tcshell you can verify your current setting using from the common line the "limit" command. Changing the setting may be done both at the user level or at the root level (there are both soft and hard limits).
Upvotes: 1