Reputation: 697
Everything was working fine on Tensorflow 1.14. I now have to update it for various reasons and seemingly the training (that I do as Google AI platform jobs) has dramatically degraded: I now get ResourceExhaustedError
for my models and even when I reduce the batch size by a bunch to get around this (which I'd rather not do anyway) the training slows down by a factor of about 5.
My migration can be summarized as my config yaml has changing from:
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
runtimeVersion: "1.14"
to
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
runtimeVersion: "2.5"
pythonVersion: "3.7"
and updated all the relevant code to be TF2.x compliant as well obviously. I also tried fiddling with the scaleTier
and masterType
to no avail.
My models are Keras-based, involve LSTM and have about 2 million and 5.5 million parameters.
What can I do here? Why this extreme degredation in training quality on google AI platform when I make this change?
Upvotes: 1
Views: 100
Reputation: 697
It appears the problem was that I was using recurrent_dropout
in my LSTM model, which seemingly is no longer supported for GPU training in Tensorflow 2.x. After removing that argument from my LSTM layers, the issue disappeared.
Notably, neither the migration instructions nor the tf_upgrade_v2 script helped with this at all.
Upvotes: 1