charmander
charmander

Reputation: 1085

How to run TensorFlow on an AWS cluster?

I'm trying to run distributed tensorflow on an EMR/EC2 cluster but I don't know how to specify different instances in the cluster to run parts of the code.

In the documentation, they've used tf.device("/gpu:0") to specify a gpu. But what if I have a master CPU and 5 different slaves GPU instances running in an EMR cluster and I want to specify those GPUs to run some code? I can't input tf.device() with the public DNS names of the instances because it throws an error saying the name cannot be resolved.

Upvotes: 17

Views: 2130

Answers (1)

pfm
pfm

Reputation: 6328

Since your question, AWS has released some code to ease the use of distributed TensorFlow on an EC2 cluster.

See this github repository. Everything is described in the README.md but the short story is that it will create an AWS stack with

  • Security Groups
  • Elastic File System
  • EC2 instances with the AWS deep learning AMI and the EFS mounted on them,
  • The EC2 instances will be configured so you can easily run a distributed tensorflow run by running a command on the master node (see the Running Distributed Training on TensorFlow section).

Upvotes: 1

Related Questions