Reputation: 199
The official way to execute multiple tf.Session()
in parallel is to use tf.train.Server
as described in Distributed TensorFlow
. On the other hand, the following works for Keras and can be modified to Tensorflow presumably without using tf.train.Server according to Keras + Tensorflow and Multiprocessing in Python.
def _training_worker(train_params):
import keras
model = obtain_model(train_params)
model.fit(train_params)
send_message_to_main_process(...)
def train_new_model(train_params):
training_process = multiprocessing.Process(target=_training_worker, args = train_params)
training_process.start()
get_message_from_training_process(...)
training_process.join()
Is the first method faster than the second method? I have a code written in the second way, and due to the nature of my algorithm (AlphaZero) a single GPU is supposed to run many processes, each of which performs prediction of tiny minibatch.
Upvotes: 3
Views: 1482
Reputation: 53768
tf.train.Server
is designed for distributed computation within a cluster, when there is a need to communicate between different nodes. This is especially useful when training is distributed across multiple machines or in some cases across multiple GPUs on a single machine. From the documentation:
An in-process TensorFlow server, for use in distributed training.
A
tf.train.Server
instance encapsulates a set of devices and atf.Session
target that can participate in distributed training. A server belongs to a cluster (specified by atf.train.ClusterSpec
), and corresponds to a particular task in a named job. The server can communicate with any other server in the same cluster.
Spawning multiple processes with multiprocessing.Process
isn't a cluster in Tensorflow sense, because the child processes aren't interacting with each other. This method is easier to setup, but it's limited to a single machine. Since you say you have just one machine, this might not be a strong argument, but if you ever plan to scale to a cluster of machines, you'll have to redesign the whole approach.
tf.train.Server
is thus a more universal and scalable solution. Besides, it allows to organize complex training with some non-trivial communications, e.g., async gradient updates. Whether it is faster to train or not greatly depends on a task, I don't think there will be a significant difference on one shared GPU.
Just for the reference, here's how the code looks like with the server (between graph replication example):
# specify the cluster's architecture
cluster = tf.train.ClusterSpec({
'ps': ['192.168.1.1:1111'],
'worker': ['192.168.1.2:1111',
'192.168.1.3:1111']
})
# parse command-line to specify machine
job_type = sys.argv[1] # job type: "worker" or "ps"
task_idx = sys.argv[2] # index job in the worker or ps list as defined in the ClusterSpec
# create TensorFlow Server. This is how the machines communicate.
server = tf.train.Server(cluster, job_name=job_type, task_index=task_idx)
# parameter server is updated by remote clients.
# will not proceed beyond this if statement.
if job_type == 'ps':
server.join()
else:
# workers only
with tf.device(tf.train.replica_device_setter(worker_device='/job:worker/task:' + task_idx,
cluster=cluster)):
# build your model here as if you only were using a single machine
pass
with tf.Session(server.target):
# train your model here
pass
Upvotes: 5