Reputation: 708
I am trying to understand how Tensorflow training depends on the results of previous runs. When we train a model, we will specify the learning rate in the optimizer to train the model with minimum cost. The learning rate will not change for sub-epoch, but it will change when the global steps has reached a multiply of sub-epoch.
def Train(...):
epoch = 5
sub_epoch = 3
for i in range(epoch):
for j in range(sub_epoch):
session.run(optimizer, ...)
Because I am not sure how Tensorflow works when training the data and if each run will rely on the internal result of previous runs, I am afraid to split the training in multiple threads which can cause inaccurate training results.
Let's say sub-epoch is 3, can we train the data for each sub-epoch in 3 different threads with same learning rate and wait for all 3 threads to complete before doing next epoch training?
Thread 1, epoch 0, sub-epoch 0: Train(data1, lr1)
Thread 2, epoch 0, sub-epoch 1: Train(data2, lr1)
Thread 3, epoch 0, sub-epoch 2: Train(data3, lr1)
[wait for all 3 threads to complete]
Thread 1, epoch 1, sub-epoch 0: Train(data4, lr2)
Thread 2, epoch 1, sub-epoch 1: Train(data5, lr2)
Thread 3, epoch 1, sub-epoch 2: Train(data6, lr2)
...
I would like to know the training dependency and please could someone tell me which of the description below is correct?
Because I am not familiar how it works, I may ask some silly questions and please feel free to tell me if any things described above are wrong (e.g. we do not need to wait for all 3 threads to complete, etc)
Upvotes: 1
Views: 91
Reputation: 5808
What you've described is asynchronous training, where variable updates are not coordinated (i.e. each worker/thread fetches whatever values are available for each variable, then sends its updates). In this case there is no equivalent single-threaded sequencing of session.run
calls, since the model "snapshots" are inconsistent.
A popular alternative is synchronous training, where each worker gets the same values for each variable. This in effect just gives you a larger batch size. tf.train.SyncReplicasOptimizer is one way to do synchronous training.
Upvotes: 1