Reputation: 21
I have an MPI program (a code in c for a school project) that I want to run on more nodes (this time 2 nodes) but it doesn't work and it is infinitely waiting without any text/error.
I am trying to run it on both machines with command mpirun -np 2 --host 192.168.0.1,192.168.0.2 ./mandelbrot_mpi_omp
(ip addresses are just as placeholder, they are different in real and correct) on both nodes while providing the ip addresses in same order on both machines so the first one is always master with rank 0.
This MPI program main function code snippet (just in case... I don't think that here is the origin of MPI not working on more nodes, but I might be wrong.):
int main(int argc, char* argv[]){
int width = SCALE_X;
int height = SCALE_Y;
// MPI init & setup
MPI_Init(&argc, &argv);
int world_size;
int rank;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// calculate size of buffer according to server count
int part_height = SCALE_Y/world_size;
int buffer_size = (width+1)*(part_height+1)*3;
// dynamically allocate arrays for image data according to server count
send_buffer = calloc( buffer_size, sizeof(PIXEL));
recv_buffer = calloc( buffer_size*world_size, sizeof(PIXEL));
if(rank == 0) printf("MPI node count: %i\n", world_size);
MPI_Barrier(MPI_COMM_WORLD);
// OpenMP setup
int cpu_count = omp_get_num_procs();
omp_set_num_threads(cpu_count);
printf("OpenMP cpu count on node %i: %i\n", rank, cpu_count);
printf("OpenMP (max) thread count on node %i: %i\n", rank, omp_get_num_threads());
MPI_Barrier(MPI_COMM_WORLD);
// generate a part of mandelbrot set according to world size and rank of this server
mandelbrot(rank, world_size, width, part_height);
// gather parts of mandelbrot from all nodes
MPI_Gather(send_buffer, (width)*(part_height)*3, MPI_CHAR, recv_buffer, (width)*(part_height)*3, MPI_CHAR, 0, MPI_COMM_WORLD);
// save raster array of mandelbrot data to png file
if(rank == 0) save_to_png(width, height);
printf("Process %i finished.\n", rank);
MPI_Finalize();
return 0;
}
I am running OpenMPI from Debian repositories, and my OS is Debian 11. (on both machines)
I tried to change -np
parameter for -n
with no effect.
If I run two processes on same machine with mpirun -np 2 --host 127.0.0.1,127.0.0.1 ./mandelbrot_mpi_omp
then it works flawlessly, it launches two processes which will do their job fine.
If I stop the task on both computers with CTRL+Z (while inifnitely waiting and not actually running) then it gives me an error:
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
my node: <hostname placeholder>
target node: <real ip here>
but those machines can communicate, i can ping them and connect to each other with ssh. They have same username and password.
What am I missing? Thanks in advance.
Upvotes: 1
Views: 1146
Reputation: 21
So the problem was that I couldn't login via ssh passwordless. Once I set it up to be possible to login to other pcs passwordless by generating pair of rsa keys on both machines, it works.
Upvotes: 1