Reputation: 1065
I have a jobqueue configuration for Slurm which looks something like:
cluster = SLURMCluster(cores=20,
processes=2,
memory='62GB',
walltime='12:00:00',
interface='ipogif0',
log_directory='logs',
python='srun -n 1 -c 20 python',
)
When increasing the number of processes, each worker gets a smaller allocation of memory. At the start of my flow for work, the tasks are highly parallelised and light on memory use. However, the end of the flow is currently in serial and requires more memory. Unless I set processes to be small (i.e. 2 or 3) the worker will 'run out' of memory and dask will restart it (which starts an infinite loop). There's more than enough memory to run the job on a single node, and I'd like to make efficient use of each node (minimising the total requested).
Is it possible to reconfigure the cluster
such that the memory available to workers is larger later on in the workflow?
Upvotes: 1
Views: 121
Reputation: 16551
Unfortunately it is not easy to change the workers on the fly. There are several workarounds discussed on GitHub: link1 and link2.
However, the simplest solution is to close the existing scheduler and start a new one with different parameters. It's possible to do this in a loop until the tasks are completed, so that each time the resources are increased but this might not work well if your queue times are substantial. Some rough pseudocode:
starting_memory = 10
while num_tasks_remaining>0:
starting_memory += 5
params_dict = {'mem': f'{starting_memory}GB'}
with SLURMCluster(**params_dict) as cluster, Client(cluster) as client:
# some code that uses the client and updates the num tasks remaining
Upvotes: 1