Reputation: 4241
I have a function (neural network model) which produces figures. I wish to test several parameters, methods and different inputs (meaning hundreds of runs of the function) from python using PBS on a standard cluster with Torque.
Note: I tried parallelpython, ipython and such and was never completely satisfied, since I want something simpler. The cluster is in a given configuration that I cannot change and such a solution integrating python + qsub will certainly benefit to the community.
To simplify things, I have a simple function such as:
import myModule
def model(input, a= 1., N=100):
do_lots_number_crunching(input, a,N)
pylab.savefig('figure_' + input.name + '_' + str(a) + '_' + str(N) + '.png')
where input
is an object representing the input, input.name
is a string, anddo_lots_number_crunching
may last hours.
My question is: is there a correct way to transform something like a scan of parameters such as
for a in pylab.linspace(0., 1., 100):
model(input, a)
into "something" that would launch a PBS script for every call to the model
function?
#PBS -l ncpus=1
#PBS -l mem=i1000mb
#PBS -l cput=24:00:00
#PBS -V
cd /data/work/
python experiment_model.py
I was thinking of a function that would include the PBS template and call it from the python script, but could not yet figure it out (decorator?).
Upvotes: 10
Views: 5841
Reputation: 347
pbs_python[1] could work for this. If experiment_model.py 'a' as an argument you could do
import pbs, os
server_name = pbs.pbs_default()
c = pbs.pbs_connect(server_name)
attopl = pbs.new_attropl(4)
attropl[0].name = pbs.ATTR_l
attropl[0].resource = 'ncpus'
attropl[0].value = '1'
attropl[1].name = pbs.ATTR_l
attropl[1].resource = 'mem'
attropl[1].value = 'i1000mb'
attropl[2].name = pbs.ATTR_l
attropl[2].resource = 'cput'
attropl[2].value = '24:00:00'
attrop1[3].name = pbs.ATTR_V
script='''
cd /data/work/
python experiment_model.py %f
'''
jobs = []
for a in pylab.linspace(0.,1.,100):
script_name = 'experiment_model.job' + str(a)
with open(script_name,'w') as scriptf:
scriptf.write(script % a)
job_id = pbs.pbs_submit(c, attropl, script_name, 'NULL', 'NULL')
jobs.append(job_id)
os.remove(script_name)
print jobs
[1]: https://oss.trac.surfsara.nl/pbs_python/wiki/TorqueUsage pbs_python
Upvotes: 4
Reputation: 932
I just started working with clusters and EP applications. My goal (I'm with the Library) is to learn enough to help other researchers on campus access HPC with EP applications...especially researchers outside of STEM. I'm still very new, but thought it may help this question to point out the use of GNU Parallel in a PBS script to launch basic python scripts with varying arguments. In the .pbs file, there are two lines to point out:
module load gnu-parallel # this is required on my environment
parallel -j 4 --env PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
--workdir $NODE_LOCAL_DIR --transfer --return 'output.{#}' --clean \
`pwd`/simple.py '{#}' '{}' ::: $INPUT_DIR/input.*
# `-j 4` is the number of processors to use per node, will be cluster-specific
# {#} will substitute the process number into the string
# `pwd`/simple.py `{#}` `{}` this is the command that will be run multiple times
# ::: $INPUT_DIR/input.* all of the files in $INPUT_DIR/ that start with 'input.'
# will be substituted into the python call as the second(3rd) argument where the
# `{}` resides. These can be simple text files that you use in your 'simple.py'
# script to pass the parameter sets, filenames, etc.
As a newby to EP supercomputing, even though I don't yet understand all the other options on "parallel", this command allowed me to launch python scripts in parallel with different parameters. This would work well if you can generate a slew of parameter files ahead of time that will parallelize your problem. For example, running simulations across a parameter space. Or processing many files with the same code.
Upvotes: 0
Reputation: 451
It looks like I'm a little late to the party, but I also had the same question of how to map embarrassingly parallel problems onto a cluster in python a few years ago and wrote my own solution. I recently uploaded it to github here: https://github.com/plediii/pbs_util
To write your program with pbs_util, I would first create a pbs_util.ini in the working directory containing
[PBSUTIL]
numnodes=1
numprocs=1
mem=i1000mb
walltime=24:00:00
Then a python script like this
import pbs_util.pbs_map as ppm
import pylab
import myModule
class ModelWorker(ppm.Worker):
def __init__(self, input, N):
self.input = input
self.N = N
def __call__(self, a):
myModule.do_lots_number_crunching(self.input, a, self.N)
pylab.savefig('figure_' + self.input.name + '_' + str(a) + '_' + str(self.N) + '.png')
# You need "main" protection like this since pbs_map will import this file on the compute nodes
if __name__ == "__main__":
input, N = something, picklable
# Use list to force the iterator
list(ppm.pbs_map(ModelWorker, pylab.linspace(0., 1., 100),
startup_args=(input, N),
num_clients=100))
And that would do it.
Upvotes: 2
Reputation: 7014
You can do this easily using jug (which I developed for a similar setup).
You'd write in file (e.g., model.py
):
@TaskGenerator
def model(param1, param2):
res = complex_computation(param1, param2)
pyplot.coolgraph(res)
for param1 in np.linspace(0, 1.,100):
for param2 in xrange(2000):
model(param1, param2)
And that's it!
Now you can launch "jug jobs" on your queue: jug execute model.py
and this will parallelise automatically. What happens is that each job will in, a loop, do something like:
while not all_done():
for t in tasks in tasks_that_i_can_run():
if t.lock_for_me(): t.run()
(It's actually more complicated than that, but you get the point).
It uses the filesystem for locking (if you're on an NFS system) or a redis server if you prefer. It can also handle dependencies between tasks.
This is not exactly what you asked for, but I believe it's a cleaner architechture to separate this from the job queueing system.
Upvotes: 3