TensorFlow: Is it possible to restore checkpoint models for multi-gpu training?

Question

I am currently using a supervisor and constructed just one graph to perform transfer learning using the pre-trained weights from TF-slim. I am wondering if there is a way to restore checkpoint models to multiple inference models at the outset? My primary concern is that firstly, the name scopes that are defined as in a reference code on the TF repository may cause the pre-trained variables to be unable to be restored due to a name mismatch. Also, given that I have to use a supervisor with an init_fn that takes in only one saver that restores the variables, how could I have multiple savers to restore the same variables to multiple GPUs (If I even need to have multiple savers at all).

One idea I have is that perhaps I could just restore the variables to one graph, and let the other GPUs use the same graph for training. However, would the training for the next GPU take place only after the first GPU has completed? But this way, I won't be able to restore the weights according to the original checkpoint model variable names too, unless I edit the names of the checkpoint weights.

Alexandre Passos · Accepted Answer

The tensorflow documentation on saving and restoring variables points you to the saver object allowing you to specify what saved variables get restored as what model variables, by passing a dictionary from saved name to variable object when constructing the saver.

TensorFlow: Is it possible to restore checkpoint models for multi-gpu training?

Answers (1)

Related Questions