JavNoor
JavNoor

Reputation: 402

Save the exact state of Tensorflow model, random state, and Datasets API pointer for debugging

TLDR: Is there a way to freeze a Tensorflow model during runtime at time t1, such that running the network from time 0 to t2>t1 would lead to exactly the same results as running it from t1 to t2?

I have searched this quite a lot and couldn't find this exact scenario:

I have a tensorflow model which is receiving inputs through Datasets API from a list of TFRecords. At very random moments I get an error regarding tensor shape incompatibility and I'm trying to figure out why. I have changed the seeds, so that the code is reproducible, but it takes about 30 minutes for the reproducible error to occur. What is the best strategy in such situations to debug the code faster?

What I have been trying has been to save a checkpoint at every iteration, hoping that by restoring the last one (right before the error) I'd be able to quickly reproduce the error later on and troubleshoot it. Unfortunately the random state and dataset api pointer get reset when I do this. Is there any way to fully store the state of a network during runtime (including its random number generator state and the Dataset API pointer), so that when it is restored the same outputs get reproduced?

Upvotes: 0

Views: 234

Answers (1)

Vlad
Vlad

Reputation: 8585

From my personal experience I would approach it in the following ways.

  1. Running the code with -i flag (python -i) which takes you to the interpreter with preserved state at the moment the script stops OR (even better) calling problematic parts of code from jupyter notebook which will also preserve the state after the exception is raised and you could investigate what the problem is more easily. If the problem is inside a function you could catch the exception and return all relevant objects. Or you could also put your functions inside the class to have a single object, instantiate and run it from jupyter and when the problem occurs you will have all variables inside that class object.

  2. Adding assert's statements for the shapes of your data and for the shapes of your model variables/placeholders. For example, if you have some preprocessing/augmentation add assert's before and after preprocessing/augmentation to be sure that the shapes are as expected.

  3. Taking a break. Sometimes you spend a lot of time and effort on something without success, but after having a rest you solve the problem immediately.

Good luck!

Upvotes: 1

Related Questions