Reputation: 131
I faced a problem with autokeras while running an example from the book. The task was to generate architecture for model trained with MNIST dataset ("hello world" difficulty task for autokeras). Also I have an issues using laptop GPU, and I have to add some extra code to enable explicit GPU usage.
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.python.keras.utils.data_utils import Sequence
import autokeras as ak
###### My special code here ##############
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.compat.v1.Session(config=config)
##########################################
(x_train, y_train), (x_test, y_test) = mnist.load_data()
clf = ak.ImageClassifier(
overwrite=True,
max_trials=10)
##########################################
with tf.device('/gpu:0'):
##########################################
clf.fit(x_train, y_train, epochs=2)
Output (epochs equals 2 to get faster results):
Trial 1 Complete [00h 00m 20s]
val_loss: 0.058981552720069885
Best val_loss So Far: 0.058981552720069885
Total elapsed time: 00h 00m 20s
Search: Running Trial #2
Hyperparameter |Value |Best Value So Far
image_block_1/block_type|resnet |vanilla
image_block_1/normalize|True |True
image_block_1/augment|True |False
image_block_1/image_augmentation_1/horizontal_flip|True |None
image_block_1/image_augmentation_1/vertical_flip|False |None
image_block_1/image_augmentation_1/contrast_factor|0.0 |None
image_block_1/image_augmentation_1/rotation_factor|0.0 |None
image_block_1/image_augmentation_1/translation_factor|0.1 |None
image_block_1/image_augmentation_1/zoom_factor|0.0 |None
image_block_1/res_net_block_1/pretrained|True |None
image_block_1/res_net_block_1/version|resnet50 |None
image_block_1/res_net_block_1/trainable|True |None
image_block_1/res_net_block_1/imagenet_size|True |None
classification_head_1/spatial_reduction_1/reduction_type|global_avg|flatten
classification_head_1/dropout|0 |0.5
optimizer |adam |adam
learning_rate |1e-05 |0.001
Epoch 1/2
2/1500 [..............................] - ETA: 5:31 - loss: 2.4616 - accuracy: 0.1562WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.1631s vs `on_train_batch_end` time: 0.2793s). Check your callbacks.
3/150Trial 1 Complete [00h 00m 20s]
val_loss: 0.058981552720069885
Best val_loss So Far: 0.058981552720069885
Total elapsed time: 00h 00m 20s
Search: Running Trial #2
Hyperparameter |Value |Best Value So Far
image_block_1/block_type|resnet |vanilla
image_block_1/normalize|True |True
image_block_1/augment|True |False
image_block_1/image_augmentation_1/horizontal_flip|True |None
image_block_1/image_augmentation_1/vertical_flip|False |None
image_block_1/image_augmentation_1/contrast_factor|0.0 |None
image_block_1/image_augmentation_1/rotation_factor|0.0 |None
image_block_1/image_augmentation_1/translation_factor|0.1 |None
image_block_1/image_augmentation_1/zoom_factor|0.0 |None
image_block_1/res_net_block_1/pretrained|True |None
image_block_1/res_net_block_1/version|resnet50 |None
image_block_1/res_net_block_1/trainable|True |None
image_block_1/res_net_block_1/imagenet_size|True |None
classification_head_1/spatial_reduction_1/reduction_type|global_avg|flatten
classification_head_1/dropout|0 |0.5
optimizer |adam |adam
learning_rate |1e-05 |0.001
Epoch 1/2
2/1500 [..............................] - ETA: 5:31 - loss: 2.4616 - accuracy: 0.1562WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.1631s vs `on_train_batch_end` time: 0.2793s). Check your callbacks.
3/1500 [..............................] - ETA: 7:11 - loss: 2.4400 - accuracy: 0.1667
---------------------------------------------------------------------------
ResourceExhaustedError Traceback (most recent call last)
<ipython-input-6-fc43cdbb1604> in <module>
1 with tf.device('/gpu:0'):
----> 2 clf.fit(x_train, y_train, epochs=2)
~/anaconda3/envs/ML/lib/python3.8/site-packages/autokeras/tasks/image.py in fit(self, x, y, epochs, callbacks, validation_split, validation_data, **kwargs)
152 **kwargs: Any arguments supported by keras.Model.fit.
153 """
--> 154 super().fit(
155 x=x,
156 y=y,
~/anaconda3/envs/ML/lib/python3.8/site-packages/autokeras/auto_model.py in fit(self, x, y, batch_size, epochs, callbacks, validation_split, validation_data, **kwargs)
277 )
278
--> 279 self.tuner.search(
280 x=dataset,
281 epochs=epochs,
~/anaconda3/envs/ML/lib/python3.8/site-packages/autokeras/engine/tuner.py in search(self, epochs, callbacks, fit_on_val_data, **fit_kwargs)
136 self.oracle.update_space(hp)
137
--> 138 super().search(epochs=epochs, callbacks=new_callbacks, **fit_kwargs)
139
140 # Train the best model use validation data.
~/anaconda3/envs/ML/lib/python3.8/site-packages/kerastuner/engine/base_tuner.py in search(self, *fit_args, **fit_kwargs)
129
130 self.on_trial_begin(trial)
--> 131 self.run_trial(trial, *fit_args, **fit_kwargs)
132 self.on_trial_end(trial)
133 self.on_search_end()
~/anaconda3/envs/ML/lib/python3.8/site-packages/kerastuner/engine/tuner.py in run_trial(self, trial, *fit_args, **fit_kwargs)
151 self._on_train_begin(model, trial.hyperparameters,
152 *fit_args, **copied_fit_kwargs)
--> 153 model.fit(*fit_args, **copied_fit_kwargs)
154
155 def _on_train_begin(model, hp, *fit_args, **fit_kwargs):
~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
106 def _method_wrapper(self, *args, **kwargs):
107 if not self._in_multi_worker_mode(): # pylint: disable=protected-access
--> 108 return method(self, *args, **kwargs)
109
110 # Running inside `run_distribute_coordinator` already.
~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1096 batch_size=batch_size):
1097 callbacks.on_train_batch_begin(step)
-> 1098 tmp_logs = train_function(iterator)
1099 if data_handler.should_sync:
1100 context.async_wait()
~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py in __call__(self, *args, **kwds)
778 else:
779 compiler = "nonXla"
--> 780 result = self._call(*args, **kwds)
781
782 new_tracing_count = self._get_tracing_count()
~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py in _call(self, *args, **kwds)
805 # In this case we have created variables on the first call, so we run the
806 # defunned version which is guaranteed to never create variables.
--> 807 return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable
808 elif self._stateful_fn is not None:
809 # Release the lock early so that multiple threads can perform the call
~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/function.py in __call__(self, *args, **kwargs)
2827 with self._lock:
2828 graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
-> 2829 return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
2830
2831 @property
~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/function.py in _filtered_call(self, args, kwargs, cancellation_manager)
1841 `args` and `kwargs`.
1842 """
-> 1843 return self._call_flat(
1844 [t for t in nest.flatten((args, kwargs), expand_composites=True)
1845 if isinstance(t, (ops.Tensor,
~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
1921 and executing_eagerly):
1922 # No tape is watching; skip to running the function.
-> 1923 return self._build_call_outputs(self._inference_function.call(
1924 ctx, args, cancellation_manager=cancellation_manager))
1925 forward_backward = self._select_forward_and_backward_functions(
~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/function.py in call(self, ctx, args, cancellation_manager)
543 with _InterpolateFunctionError(self):
544 if cancellation_manager is None:
--> 545 outputs = execute.execute(
546 str(self.signature.name),
547 num_outputs=self._num_outputs,
~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
57 try:
58 ctx.ensure_initialized()
---> 59 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
60 inputs, attrs, num_outputs)
61 except core._NotOkStatusException as e:
ResourceExhaustedError: OOM when allocating tensor with shape[65536] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node functional_1/global_average_pooling2d/Mean (defined at /home/biowar/anaconda3/envs/ML/lib/python3.8/site-packages/kerastuner/engine/tuner.py:153) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_train_function_37301]
Function call stack:
train_function
0 [..............................] - ETA: 7:11 - loss: 2.4400 - accuracy: 0.1667
---------------------------------------------------------------------------
ResourceExhaustedError Traceback (most recent call last)
<ipython-input-6-fc43cdbb1604> in <module>
1 with tf.device('/gpu:0'):
----> 2 clf.fit(x_train, y_train, epochs=2)
~/anaconda3/envs/ML/lib/python3.8/site-packages/autokeras/tasks/image.py in fit(self, x, y, epochs, callbacks, validation_split, validation_data, **kwargs)
152 **kwargs: Any arguments supported by keras.Model.fit.
153 """
--> 154 super().fit(
155 x=x,
156 y=y,
~/anaconda3/envs/ML/lib/python3.8/site-packages/autokeras/auto_model.py in fit(self, x, y, batch_size, epochs, callbacks, validation_split, validation_data, **kwargs)
277 )
278
--> 279 self.tuner.search(
280 x=dataset,
281 epochs=epochs,
~/anaconda3/envs/ML/lib/python3.8/site-packages/autokeras/engine/tuner.py in search(self, epochs, callbacks, fit_on_val_data, **fit_kwargs)
136 self.oracle.update_space(hp)
137
--> 138 super().search(epochs=epochs, callbacks=new_callbacks, **fit_kwargs)
139
140 # Train the best model use validation data.
~/anaconda3/envs/ML/lib/python3.8/site-packages/kerastuner/engine/base_tuner.py in search(self, *fit_args, **fit_kwargs)
129
130 self.on_trial_begin(trial)
--> 131 self.run_trial(trial, *fit_args, **fit_kwargs)
132 self.on_trial_end(trial)
133 self.on_search_end()
~/anaconda3/envs/ML/lib/python3.8/site-packages/kerastuner/engine/tuner.py in run_trial(self, trial, *fit_args, **fit_kwargs)
151 self._on_train_begin(model, trial.hyperparameters,
152 *fit_args, **copied_fit_kwargs)
--> 153 model.fit(*fit_args, **copied_fit_kwargs)
154
155 def _on_train_begin(model, hp, *fit_args, **fit_kwargs):
~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
106 def _method_wrapper(self, *args, **kwargs):
107 if not self._in_multi_worker_mode(): # pylint: disable=protected-access
--> 108 return method(self, *args, **kwargs)
109
110 # Running inside `run_distribute_coordinator` already.
~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1096 batch_size=batch_size):
1097 callbacks.on_train_batch_begin(step)
-> 1098 tmp_logs = train_function(iterator)
1099 if data_handler.should_sync:
1100 context.async_wait()
~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py in __call__(self, *args, **kwds)
778 else:
779 compiler = "nonXla"
--> 780 result = self._call(*args, **kwds)
781
782 new_tracing_count = self._get_tracing_count()
~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py in _call(self, *args, **kwds)
805 # In this case we have created variables on the first call, so we run the
806 # defunned version which is guaranteed to never create variables.
--> 807 return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable
808 elif self._stateful_fn is not None:
809 # Release the lock early so that multiple threads can perform the call
~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/function.py in __call__(self, *args, **kwargs)
2827 with self._lock:
2828 graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
-> 2829 return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
2830
2831 @property
~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/function.py in _filtered_call(self, args, kwargs, cancellation_manager)
1841 `args` and `kwargs`.
1842 """
-> 1843 return self._call_flat(
1844 [t for t in nest.flatten((args, kwargs), expand_composites=True)
1845 if isinstance(t, (ops.Tensor,
~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
1921 and executing_eagerly):
1922 # No tape is watching; skip to running the function.
-> 1923 return self._build_call_outputs(self._inference_function.call(
1924 ctx, args, cancellation_manager=cancellation_manager))
1925 forward_backward = self._select_forward_and_backward_functions(
~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/function.py in call(self, ctx, args, cancellation_manager)
543 with _InterpolateFunctionError(self):
544 if cancellation_manager is None:
--> 545 outputs = execute.execute(
546 str(self.signature.name),
547 num_outputs=self._num_outputs,
~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
57 try:
58 ctx.ensure_initialized()
---> 59 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
60 inputs, attrs, num_outputs)
61 except core._NotOkStatusException as e:
ResourceExhaustedError: OOM when allocating tensor with shape[65536] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node functional_1/global_average_pooling2d/Mean (defined at /home/biowar/anaconda3/envs/ML/lib/python3.8/site-packages/kerastuner/engine/tuner.py:153) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_train_function_37301]
Function call stack:
train_function
Output of nvidia-smi (during Trial 1):
Every 0,5s: nvidia-smi Nitro5: Sun Aug 30 12:59:30 2020
Sun Aug 30 12:59:31 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 165... Off | 00000000:01:00.0 Off | N/A |
| N/A 47C P0 32W / N/A | 1101MiB / 3911MiB | 41% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1691 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2362 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 7530 C ...conda3/envs/ML/bin/python 251MiB |
| 0 N/A N/A 37376 C ...conda3/envs/ML/bin/python 837MiB |
+-----------------------------------------------------------------------------+
Output of nvidia-smi (after Trial 2 started):
Every 0,5s: nvidia-smi Nitro5: Sun Aug 30 12:58:02 2020
Sun Aug 30 12:58:02 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 165... Off | 00000000:01:00.0 Off | N/A |
| N/A 41C P8 1W / N/A | 3885MiB / 3911MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1691 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2362 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 7530 C ...conda3/envs/ML/bin/python 251MiB |
| 0 N/A N/A 35239 C ...conda3/envs/ML/bin/python 3621MiB |
+-------------------------------------------------------------------------
How can I modify my code to prevent using 100% of my GPU after successful Trial 1?
Upvotes: 2
Views: 836