Python ( or maybe linux in general) file operation flow control or file lock

Question

I am using a cluster of computers to do some parallel computation. My home directory is shared across the cluster. In one machine, I have a ruby code that creates bash script containing computation command and write the script to, say, ~/q/ directory. The scripts are named *.worker1.sh, *.worker2.sh, etc.

On other 20 machines, I have 20 python code running ( one at each machine ) that (constantly) check the ~/q/ directory and look for jobs that belong to that machine, using a python code like this:

jobs = glob.glob('q/*.worker1.sh')
[os.system('sh ' + job + ' &') for job in jobs]

For some additional control, the ruby code will create a empty file like workeri.start (i = 1..20) at q directory after it write the bash script to q directory, the python code will check for that 'start' file before it runs the above code. And in the bash script, if the command finishes successfully, the bash script will create an empty file like 'workeri.sccuess', the python code checks this file after it runs the above code to make sure the computation finishs successfully. If python finds out that the computation finishs successfully, it will remove the 'start' file in q directory, so the ruby code knows that job finishs successfully. After the 20 bash script all finished, the ruby code will create new bash script and python read and executes new scripts and so on.

I know this is not a elegant way to coordinate the computation, but I haven't figured out a better to communicate between different machines.

Now the question is: I expect that the 20 jobs will run somewhat in parallel. The total time to finish the 20 jobs will not be much longer than the time to finish one job. However, it seems that these jobs runs sequentially and time is much longer than I expected.

I suspect that part of the reason is that multiple codes are reading and writing the same directory at once but the linux system or python locks the directory and only allow one process to oprate the directory. This makes the code execute one at a time.

I am not sure if this is the case. If I split the bash scripts to different directories, and let the python code on different machines read and write different directories, will that solve the problem? Or is there any other reasons that cause the problem?

Thanks a lot for any suggestions! Let me know if I didn't explain anything clearly.

Some additional info: my home directory is at /home/my_group/my_home, here is the mount info for it :/vol/my_group on /home/my_group type nfs (rw,nosuid,nodev,noatime,tcp,timeo=600,retrans=2,rsize=65536,wsize=65536,addr=...)

I say constantly check the q directory, meaning a python loop like this:

While True:
    if 'start' file exists:
        find the scripts and execute them as I mentioned above

grncdr · Accepted Answer

I know this is not a elegant way to coordinate the computation, but I haven't figured out a better to communicate between different machines.

While this isn't directly what you asked, you should really, really consider fixing your problem at this level, using some sort of shared message queue is likely to be a lot simpler to manage and debug than relying on the locking semantics of a particular networked filesystem.

The simplest solution to set up and run in my experience is redis on the machine currently running the Ruby script that creates the jobs. It should literally be as simple as downloading the source, compiling it and starting it up. Once the redis server is up and running, you change your code to append your the computation commands to one or more Redis lists. In ruby you would use the redis-rb library like this:

require "redis"

redis = Redis.new
# Your other code to build up command lists...
redis.lpush 'commands', command1, command2...

If the computations need to be handled by certain machines, use a list per-machine like this:

redis.lpush 'jobs:machine1', command1
# etc.

Then in your Python code, you can use redis-py to connect to the Redis server and pull jobs off the list like so:

from redis import Redis
r = Redis(host="hostname-of-machine-running-redis")
while r.llen('jobs:machine1'):
    job = r.lpop('commands:machine1')
    os.system('sh ' + job + ' &')

Of course, you could just as easily pull jobs off the queue and execute them in Ruby:

require 'redis'
redis = Redis.new(:host => 'hostname-of-machine-running-redis')
while redis.llen('jobs:machine1')
    job = redis.lpop('commands:machine1')
    `sh #{job} &`
end

With some more details about the needs of the computation and the environment it's running in, it would be possible to recommend even simpler approaches to managing it.

Python ( or maybe linux in general) file operation flow control or file lock

Answers (2)

Related Questions