c0mr4t
c0mr4t

Reputation: 361

How do I checkpoint only the best model from a ray tune run?

NOTE: To some extent, this was already asked here but my question tackles a different aspect of getting the best checkpoint.

In the referenced question, the author only desired to retrieve the best checkpoint from a set of checkpoints after the ray tune run. I want to ensure that only the best checkpoint is saved in the first place. So basically, I am looking for something like:

At this position, the ray checkpointing callback would be triggered. Check, if the current model state is better than the current "best checkpoint". If so, then delete the old "best checkpoint" and replace it by checkpointing the current model state. If not, don't trigger the checkpointing callback.

The reason for that is that I am testing hundreds of large models simultaneously and I have to save disk memory.

Upvotes: 4

Views: 1533

Answers (1)

c0mr4t
c0mr4t

Reputation: 361

I didn't solve the issue as the need was no longer present at a later point in time. But for all who run into a similar issue, here is a suggestion that MIGHT work:

You have basically two options. Either interfere with RayTune's main process or control the models in its child processes directly. I think, messing with RayTune's main process is more complicated, so I'd go with the subprocesses.

During training, Ray is logging its progress and model results into files. You could check into which exact files Ray is logging these model results. Afterward, you remove all checkpointing mechanisms that existed so far in your project. You then introduce a custom checkpoint callback in the training function of your model. This custom callback checks the model results files and ONLY if it actually performed the best, the model is checkpointed to a central folder in your project (and eventually overrides a previous best).

Issues you might run into:

  1. How can a subprocess identify itself? So basically if ray tune says "model 3 is currently best"... how does the subprocess know that it's model 3?

    I am sure that there are multiple ways to deal with this issue (the most obvious way to differentiate between models would be the ray tune params that are set in models).

  2. How can you be sure that the model result files are always up to date?

    If files are not flushed properly, it might happen that you only get the second or third best model. I don't think that really matters with hundreds of models but if you want the absolute best, that is something you should be aware of.

Upvotes: 1

Related Questions