chesschi
chesschi

Reputation: 708

Tensorflow: Does the training depend on the result of previous runs

I am trying to understand how Tensorflow training depends on the results of previous runs. When we train a model, we will specify the learning rate in the optimizer to train the model with minimum cost. The learning rate will not change for sub-epoch, but it will change when the global steps has reached a multiply of sub-epoch.

def Train(...):
    epoch = 5
    sub_epoch = 3
    for i in range(epoch):
        for j in range(sub_epoch):
            session.run(optimizer, ...)

Because I am not sure how Tensorflow works when training the data and if each run will rely on the internal result of previous runs, I am afraid to split the training in multiple threads which can cause inaccurate training results.

Let's say sub-epoch is 3, can we train the data for each sub-epoch in 3 different threads with same learning rate and wait for all 3 threads to complete before doing next epoch training?

Thread 1, epoch 0, sub-epoch 0: Train(data1, lr1)
Thread 2, epoch 0, sub-epoch 1: Train(data2, lr1)
Thread 3, epoch 0, sub-epoch 2: Train(data3, lr1)
[wait for all 3 threads to complete]
Thread 1, epoch 1, sub-epoch 0: Train(data4, lr2)
Thread 2, epoch 1, sub-epoch 1: Train(data5, lr2)
Thread 3, epoch 1, sub-epoch 2: Train(data6, lr2)
...

I would like to know the training dependency and please could someone tell me which of the description below is correct?

  1. The training depends on the results of previous runs regardless of the learning rate
  2. The training depends on the results of previous epoch runs with different learning rate (i.e. the scenario described above)
  3. The training does not depend on the previous results at all
  4. Others -- please explain

Because I am not familiar how it works, I may ask some silly questions and please feel free to tell me if any things described above are wrong (e.g. we do not need to wait for all 3 threads to complete, etc)

Upvotes: 1

Views: 91

Answers (1)

Allen Lavoie
Allen Lavoie

Reputation: 5808

What you've described is asynchronous training, where variable updates are not coordinated (i.e. each worker/thread fetches whatever values are available for each variable, then sends its updates). In this case there is no equivalent single-threaded sequencing of session.run calls, since the model "snapshots" are inconsistent.

A popular alternative is synchronous training, where each worker gets the same values for each variable. This in effect just gives you a larger batch size. tf.train.SyncReplicasOptimizer is one way to do synchronous training.

Upvotes: 1

Related Questions