Reputation: 55
I have a trained a tensorflow seq2seq model for 30 epochs, and saved a checkpoint for each epoch. What I want to do now is combining the best X of those checkpoints (based on results on a development set). Specifically, I'm looking for a way that lets me average different model weights and merge them into a new model that can be used for decoding. However, there does not seem to be a set way for this, and loading different models can be a bit tricky. But even if this succeeds, I can't find a good answer on how to combine the weights in a new model.
Any help would be greatly appreciated.
Related questions (that do not sufficiently answer in my opinion):
Building multiple models in the same graph
How to load several identical models from save files into one session in Tensorflow
How to create ensemble in tensorflow?
Upvotes: 0
Views: 1053
Reputation: 2670
First, a bit of terminology:
In ensembles (as I understand them) you have N models at test time and you combine their predictions (by voting, or even better combining probabilistic distributions and using as input for further decoding in case of autoregressive seq2seq decoders). You can have independent ensembles (training each model independently from scratch, with different random initialization) or checkpoint ensembles (taking N last checkpoints, or possibly N checkpoints with best validation score). See e.g. Sennrich et al., 2017 for a comparison of these two types of ensembles.
In averaging you average the weights of N models, so at test time you have just one averaged model. This gives usually worse results than real ensembles, but it is much faster, so you can afford higher N. If the models are trained completely independently with different random initialization, averaging does not work at all. However, if the models share a reasonable number of initial training steps, averaging may work. A special case is checkpoint averaging where the last N checkpoints are averaged, but you can try even "forking" the training and using "semi-independent" models for averaging (in addition to checkpoint averaging). It may be very useful to use constant or cyclical learning rate, see Izmailov et al., 2018.
As for your question, how to do the averaging of Tensorflow checkpoints: See avg_checkpoints.py or t2t-avg-all.
Upvotes: 2
Reputation: 2878
Doing an average of the weights of several models to produce a new one is unlikely to produce a useful result.
For a simple example, think about a classic CNN like AlexNet. Its first layer will contain a series of 2d filters looking for different image features. For every model you train from scratch, it's likely that similar features may show up in the filters but the orders they occur will be very different, and so just averaging weights will destroy most of the information.
Upvotes: -1