Reputation: 1914
I have the opportunity to run my Tensorflow training on a cluster computer with slurm workload manager (the cluster contains nearly 400000 cores, 40000 GB of RAM, Performance is Rmax=500 TFlop/s and Rpeak=1000 TFlop/s, AMD GPUs).
I work on image processing projects using deep learning algorithms.
My question is how to scale my keras deep learning to run on this cluster using slurm as workload manager ?
Upvotes: 2
Views: 1461
Reputation: 21
Use Horovod to scale out the Keras training - https://github.com/uber/horovod
Upvotes: 2