Reputation: 333
I'm trying to run a python script in which part of the code is going to be parallelized according to some SLURM environment variables. I don't think the exact code is important, but for reference, I would like to use this to train my networks.
Now, the problem is that I need to run my script via srun
, however this will spawn multiple parallel instances of my script which I don't want.
The most basic example would be this:
#!/bin/sh
#SBATCH -N 2
#SBATCH --ntasks=2
srun python myscript.py
Now I will have 2 nodes and 2 tasks, meaning that when I run python myscript.py
there will be 2 instances of myscript.py
running in parallel.
However, this is not what I want. I would like there to be only one instance of myscript.py
running, however it should have access to the environment variables set by srun
, and leave it to the python script to properly distribute the resources.
settings srun --ntasks=1
does not work, since then the script will only 'see' one of the nodes.
Is it possible to use srun
to run a single instance of the script while it still has 'access' to the SLURM environment variables? I've looked at options such as --exclusive
and --preserve-env
, but they do not seem to help me in this case.
Upvotes: 2
Views: 8295
Reputation: 333
Turns out that Hristo Iliev was right in the comments, to use the SlurmClusterResolver properly multiple jobs need to run in parallel. This can be a bit confusing as everything will be printed multiple times because everything is run in parallel, but this is normal.
However, my initial confusion and my assumption that it must be done as stated in the original question came from TensorFlow reporting out of memory errors whenever I tried to use the MultiWorkerMirrored strategy, whereas I knew that without this the model fitted perfectly within the available memory.
Somewhere I made a call to tf.config.get_visible_devices("GPU")
in my code. Now in order for TensorFlow to get the GPUs, it will allocate them, and by default does so by filling up the complete GPU memory. However, since all scripts are running in parallel, each script will try to do this for themselves (since this is done outside the scope of the strategy), resulting in out of memory (OOM) errors.
After removing this piece of code, everything ran fine.
Suggestion for people that might stumble upon this post in the future:
- Scripts are supposed to be run in parallel, you will see multiple times the same outputs
- Make sure that everything is done under strategy.scope()
, i.e. model compiling, data generation setup (using tf.data)
- Pay special attention to saving the model; only the 'main' worker should save the model to the real save file, the others should write to temporary files see here
If you get out of memory errors; make sure that there is not some piece of code that allocates all the GPUs outside of the scope. This can be some initiation somewhere by TensorFlow, but if this is present in all scripts it will cause OOM errors. A handy way to test this is to use tf.config.experimental.set_memory_growth
, to allow memory growth instead of full memory allocation.
In my code, I used the get_task_info()
function of tf.distribute.cluster_resolver.SlurmClusterResolver
, and only ran functions that allocate memory when the task number was 0, the main worker.
(Above functions and comments are based on TensorFlow 2.2.0 and Python 3.7.7)
Upvotes: 3