Reputation: 885
I am running into a few issues with the SSH and MPICH executing. From some previous questions that I asked, I was able to progress to a point where I executed the mpi_hello.c program.
For reference, I am working on following this tutorial on setting up MPICH: https://help.ubuntu.com/community/MpichCluster
I created a directory in root called clusterFiles and I created a user on all of the nodes called clusterUser (clusteruser). I exported clusterFiles and I mounted clusterFiles in all of the nodes. Also, I changed ownership of clusterFiles to clusterUser on the master node. I also changed the home directory of clusterUser to be /clusterFiles.
I created an ssh key for clusterUser on the master node and I added the key to the authorized lists. I installed a keychain on the all nodes and on the master node I edited the .bashrc as specificed in the guide (I copied what was in the guide into .bashrc)
I also installed MPICH2 and GCC on all nodes.
I edited the machine file for my scpecific cluster.
However, when I go to execute the MPI hello_world.c program, this is where the errors occur.
I copied and pasted the code on the guide into a .c file and called it mpi_hello.c (This was done on the master node).
In the guide, the last part he just calls mpicc [arguments]
and mpiexec [arguments]
. However, when I go to call mpicc, I need to sudo mpicc [arguments]
. Is this a problem I should be concerned with or would this be the proper way that it should it be done?
When I run mpiexec (without sudo), I recieve that following errors:
clusteruser@rgcluster2blade1:~$ mpiexec -n 7 -f machinefile ./mpi_hello
[mpiexec@rgcluster2blade1] HYDU_parse_hostfile (./utils/args/args.c:323): unable to open host file: machinefile
[mpiexec@rgcluster2blade1] mfile_fn (./ui/mpich/utils.c:341): error parsing hostfile
[mpiexec@rgcluster2blade1] match_arg (./utils/args/args.c:153): match handler returned error
[mpiexec@rgcluster2blade1] HYDU_parse_array (./utils/args/args.c:175): argument matching returned error
[mpiexec@rgcluster2blade1] parse_args (./ui/mpich/utils.c:1609): error parsing input array
[mpiexec@rgcluster2blade1] HYD_uii_mpx_get_parameters (./ui/mpich/utils.c:1660): unable to parse user arguments
[mpiexec@rgcluster2blade1] main (./ui/mpich/mpiexec.c:153): error parsing parameters
Are these files something that forgot to install? At first, I am thinking that I need sudo in front of mpiexec. So when I perform: sudo mpiexec [arguments]
it "runs" but connects to the SSH cluster as root when I need it to connect as clusteruser.
My main concern is that he is not executing his commands as root. I am wondering if there is a step that is implied or at least there is a command that I was suppose to execute but didn't?
Also, I noticed that when I tried changing ownership of clusterFiles to clusterUser on the other nodes, I would get an operating not permitted error (I was root when I did this command). My thinking is that since I changed the ownership on the master node, it propagated to the other nodes since they have same username. So I was effectively changing the ownership to itself. Is this a correct thinking or is there more to it then that?
Edit:
From the suggestion of user Zulan, I have checked the permissions of the machinefile
Interestingly enough, it is still set to rgcluster2blade1. I decided to run the command sudo chown -R clusteruser /clusterFiles
in order to make all files/folders within clusterFiles to be owned by clusteruser. I have done this on the master node only. Will be checking the other nodes.
Edit 2:
Ok so after checking the rest of the cluster (I am only expermineting with 4 right now before doing the whole thing) I found that 2 of the nodes were giving permission to another user besides clusteruser. They were giving it to the user render. I attempted to perform sudo chown command
but on both, I recieved an Operation not permitted error
Upvotes: 1
Views: 1937
Reputation: 885
Just as an update. Since I discovered that the GID and UID are all messed up, I decided to delete the user and create a new account. Before doing anything, I made sure to check and, if needed, change the UID and GID of the users such that they are the same on all nodes. I cannot remember the command off the top of my head. Will look for it later. Once I find it, I will update this answer.
Anftwards, I proceeded with the guide and everything worked fine.
Upvotes: 1