Reputation: 3
I am using google Collab to train a 3D autoencoder. I successfully trained the model using model.fit function with the following model.summary():
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 128, 128, 128, 1) 0
_________________________________________________________________
conv3d_1 (Conv3D) (None, 64, 64, 64, 64) 1792
_________________________________________________________________
conv3d_2 (Conv3D) (None, 32, 32, 32, 128) 221312
_________________________________________________________________
conv3d_3 (Conv3D) (None, 16, 16, 16, 256) 884992
_________________________________________________________________
conv3d_4 (Conv3D) (None, 8, 8, 8, 256) 1769728
_________________________________________________________________
conv3d_5 (Conv3D) (None, 8, 8, 8, 256) 1769728
_________________________________________________________________
up_sampling3d_1 (UpSampling3 (None, 16, 16, 16, 256) 0
_________________________________________________________________
conv3d_6 (Conv3D) (None, 16, 16, 16, 256) 1769728
_________________________________________________________________
up_sampling3d_2 (UpSampling3 (None, 32, 32, 32, 256) 0
_________________________________________________________________
conv3d_7 (Conv3D) (None, 32, 32, 32, 128) 884864
_________________________________________________________________
up_sampling3d_3 (UpSampling3 (None, 64, 64, 64, 128) 0
_________________________________________________________________
conv3d_8 (Conv3D) (None, 64, 64, 64, 64) 221248
_________________________________________________________________
up_sampling3d_4 (UpSampling3 (None, 128, 128, 128, 64) 0
_________________________________________________________________
conv3d_9 (Conv3D) (None, 128, 128, 128, 1) 1729
=================================================================
Total params: 7,525,121
Trainable params: 7,525,121
Non-trainable params: 0
the training is successful, and I saved the model as model.h5
i ran a seperate cell within the same project to test the model with the following code:
from tensorflow.keras.models import load_model
from keras import backend as K
K.clear_session()
model = load_model('model.h5')
x_test = np.load('test.npy')
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.
x_train = np.reshape(x_train, (len(x_train), 128, 128, 128, 1))
x_test = np.reshape(x_test, (len(x_test), 128, 128, 128, 1))
decoded_imgs = model.predict(x_test)
and it prompts me the following error code:
ResourceExhaustedError: OOM when allocating tensor with shape[25,128,128,64,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node model_1/up_sampling3d_4/concat_1 (defined at :47) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [Op:__inference_predict_function_63155] Function call stack: predict_function
why is it possible that I am able to train the model with the same system but not able to run the model.predict?! anyone has an answer please :(
I am using googla colab pro with the following GPU specs:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 |
| N/A 62C P0 48W / 250W | 15559MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Upvotes: 0
Views: 687
Reputation: 1275
Tensorflow errors can be really painful to parse out. I see them all the time, and yet I have no idea what this one means. I have a couple suggestions on things to try out. First, see if it really is a resource error and feed in just one example to predict.
x_test_single = np.reshape(x_test[0], (1, 128, 128, 128, 1))
model.predict(x_test_single)
Secondly, typically if you want to test an entire set you'd use model.evaluate, see if that works. Lastly, if you really are having resource allocation issues (i.e. you can't fit the as much as you want to in GPU memory since you have 3D sets that are super memory hungry) then use the data API and make datasets and feed in batches.
Just one more suggestion, I find it better to use the tensorflow SavedModel format if you are saving. https://www.tensorflow.org/tutorials/keras/save_and_load#savedmodel_format
Upvotes: 0