Migrating to Tensorflow 2.x from 1.x results in much slower training and ResourceExhaustedErrors on Google AI platform

Question

Everything was working fine on Tensorflow 1.14. I now have to update it for various reasons and seemingly the training (that I do as Google AI platform jobs) has dramatically degraded: I now get ResourceExhaustedError for my models and even when I reduce the batch size by a bunch to get around this (which I'd rather not do anyway) the training slows down by a factor of about 5.

My migration can be summarized as my config yaml has changing from:

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_gpu
  runtimeVersion: "1.14"

to

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_gpu
  runtimeVersion: "2.5"
  pythonVersion: "3.7"

and updated all the relevant code to be TF2.x compliant as well obviously. I also tried fiddling with the scaleTier and masterType to no avail.

My models are Keras-based, involve LSTM and have about 2 million and 5.5 million parameters.

What can I do here? Why this extreme degredation in training quality on google AI platform when I make this change?

Patrick · Accepted Answer

It appears the problem was that I was using recurrent_dropout in my LSTM model, which seemingly is no longer supported for GPU training in Tensorflow 2.x. After removing that argument from my LSTM layers, the issue disappeared.

Notably, neither the migration instructions nor the tf_upgrade_v2 script helped with this at all.

Migrating to Tensorflow 2.x from 1.x results in much slower training and ResourceExhaustedErrors on Google AI platform

Answers (1)

Related Questions