Vickyyy
Vickyyy

Reputation: 197

Python use slurm for multiprocessing

I want to run a simple task using multiprocessing (I think this one is the same as using parfor in matlab correct?)

For example:

from multiprocessing import Pool
def func_sq(i):
    fig=plt.plot(x[i,:])     #x is a ready-to-use large ndarray, just want
    fig.save(....)           #to plot each column on a separate figure

pool = Pool()
pool.map(func_sq,[1,2,3,4,5,6,7,8])

But I am very confused of how to use slurm to submit my job. I have been searching for answers but could not find a good one. Currently, while not using multiprocessing, I am using slurm job sumit file like this:(named test1.sh)

#!/bin/bash

#SBATCH -N 1
#SBATCH -n 1
#SBATCH -p batch
#SBATCH --exclusive

module load anaconda3
source activate py36
srun python test1.py

Then, I type in sbatch test1.sh in my prompt window.

So if I would like to use the multiprocessing, how should I modify my sh file? I have tried by myself but it seems just changing my -n to 16 and Pool(16) makes my job repeat 16 times.

Or is there a way to maximize my performance if multiprocessing is not suitable (I have heard about multithreating but don't know how it exactly works)

And how do I effectively use my memory so that it won't crush? (My x matrix is very large)

Thanks!

For the GPU, is that possible to do the same thing? My current submission script without multiprocessing is:

#!/bin/bash

#SBATCH -n 1
#SBATCH -p gpu
#SBATCH --gres=gpu:1

Thanks a lot!

Upvotes: 5

Views: 5711

Answers (2)

Colin
Colin

Reputation: 10820

The "-n" flag is setting the number of tasks your sbatch submission is going to execute, which is why your script is running multiple times. What you want to change is the "-c" argument which is how many CPUs each task is assigned. Your script spawns additional processes but they will be children of the parent process and share the CPUs assigned to it. Just add "#SBATCH -c 16" to your script. As for memory, there is a default amount of memory your job will be given per CPU, so increasing the number of CPUs will also increase the amount of memory available. If you're not getting enough, add something like "#SBATCH --mem=20000M" to request a specific amount.

Upvotes: 7

scnerd
scnerd

Reputation: 6113

I don't mean to be unwelcoming here, but this question seems to indicate that you don't actually understand the tools you're using here. Python Multiprocessing allows a single Python program to launch child processes to help it perform work in parallel. This is particularly helpful because multithreading (which is commonly how you'd accomplish this in other programming languages) doesn't gain you parallel code execution in Python, due to Python's Global Interpreter Lock.

Slurm (which I don't use, but from some quick research) seems to be a fairly high-level utility that allows individuals to submit work to some sort of cluster of computers (or a supercomputer... usually similar concepts). It has no visibility, per se, into how the program it launches runs; that is, it has no relationship to the fact that your Python program proceeds to launch 16 (or however many) helper processes. Its job is to schedule your Python program to run as a black box, then sit back and make sure it finishes successfully.

You seem to have some vague data processing problem. You describe it as a large matrix, but you don't give nearly enough information for me to actually understand what you're trying to accomplish. Regardless, if you don't actually understand what you're doing and how the tools you're using work, you're just flailing until you maybe eventually get lucky enough for this to work. Stop guessing, figure out what these tools do, look around and read documentation, then figure out what you're trying to accomplish and how you could go about splitting up the work in a reasonable fashion.

Here's my best guess, but I really have very little information to work from so it may not be helpful at all:

  • Your Python script has no concept that it's being run multiple times by Slurm (the -n 16 you refer to, I guess). It makes sense, then, that the job gets repeated 16 times, because Slurm runs the entire script 16 times, and each time your Python script does the entire task from start to finish. If you want Slurm and your Python program to interact, so that the Python program expects to get run multiple times in parallel, I have no idea how to help you there, you'll just need to read more into Slurm.
  • Your data must be able to be read incrementally, or partially, if you have any hope of breaking this job into pieces. That is, if you can only read the entire matrix all at once, or not at all, you're stuck with solutions that begin by reading the entire matrix into memory, which you indicate is not really an option. Assuming you can, and that you want to perform some work on each row independently, then you're fortunate enough for your task to be what's officially known as "embarrassingly parallel". This is a very good thing, if true.
  • Assuming your problem is embarrassingly parallel (since it looks like you're just trying to load each row of your data matrix, plot it somehow, then save off that plot as an image to disk), and you can load your data incrementally, then continue reading up on Python's multiprocessing module, and Pool().map is probably the right direction to be headed in. Create some Python generator that produces rows of your data matrix, then pass that generator and func_sq to pool.map, and sit back and wait for the job to finish.
  • If you really need to do this work across multiple machines, rather than hacking your own Slurm + Multiprocessing stack, I'd suggest you start using actual data processing tools, such as PySpark.

This doesn't sound like a trivial problem, and even if it were, you don't give sufficient details for me to provide a robust answer. There's no "just fix this one line" answer to what you've asked, but I hope this helps give you an idea of what your tools are doing and how to proceed from here.

Upvotes: -2

Related Questions