Reputation: 63
I was trying to train a small network to get familiar with TensorFlow 2.0. However, it seems that tensorflow does not work properly on my computer. Here is my code:
import tensorflow as tf
from functools import reduce
from tensorflow.keras import layers, Sequential, datasets
import numpy as np
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
train_images, test_images = train_images / 255.0, test_images / 255.0
class Model():
weights = []
biases = []
def weights_collect(self):
for l in self.layers:
try:
self.weights.append(l.kernel)
self.biases.append(l.bias)
except:
pass
def __init__(self):
self.layers = [
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
layers.MaxPool2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPool2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPool2D((2, 2)),
layers.Flatten(),
layers.Dense(1024, activation='relu'),
layers.Dense(10)
]
self.model = Sequential(self.layers)
self.weights_collect()
@tf.function
def predict_logits(self, X):
return self.model(X)
@tf.function
def __call__(self, X):
return tf.nn.softmax(self.model(X))
@tf.function
def loss(m:Model, x:np.ndarray, t:np.ndarray):
logits = m.predict_logits(x)
tar = tf.one_hot(t, 10)
return tf.math.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(tar, logits))
@tf.function
def acc(m, X, T):
logits = m.predict_logits(X)
target = tf.reshape(tf.dtypes.cast(T, tf.dtypes.int32), [-1])
pred = tf.math.argmax(logits, axis=1, output_type=tf.dtypes.int32)
return tf.math.reduce_sum(tf.dtypes.cast(pred==target, tf.dtypes.int32))/tf.shape(X)[0]
BATCH_SIZE = 1
dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels)).batch(BATCH_SIZE)
m = Model()
opt = tf.keras.optimizers.Adam()
EPOCHS = 20
for i in range(EPOCHS):
for x, t in dataset:
with tf.GradientTape() as tape:
loss_value = loss(m, x, t)
grads = tape.gradient(loss_value, m.model.trainable_weights)
opt.apply_gradients(zip(grads, m.model.trainable_weights))
print("hello")
print(float(loss(m, test_images, test_labels)))
print(float(acc(m, test_images, test_labels)))
When running this code, I kept getting this kind of error message:
Allocation of 1228800000 exceeds 10% of system memory.
And my model will stop training after this.
I have tried to change the batch size, but it still does not work. Model got trained for several iterations and died. Even I have changed the batch size to 1.
It seems that TensorFlow keeps allocating system memory during training while does not release it.
I have also reinstalled the entire system trying to fix this, and it still does not work.
Upvotes: 2
Views: 998
Reputation: 67
i was suffering from the same issue for a many weeks and finally
i got that
i ran using window10 and windows server
using powershell
" .\nvidia-smi -q -i 0 -d SUPPORTED_CLOCKS"
and see that NVidia driver is using Cuda 10.2
when try to downgrade my NVIDIA DRIVER to Cuda 10.1 or CUDA 10.0 it finally worked
seem there is an issue with Supported CUDA10.2
Upvotes: 0
Reputation: 1
I have also had a problem training a MaskRCNN model using the tensorflow object detection api (it would just hang after the first step, no GPU usage and a constant ~18 CPU). The script worked fine a fortnight ago.
Long story short, I rolled back to the driver suggested and it all works well again. I also tried the latest update (the one after 440.97), and the problem was still there.
Upvotes: 0
Reputation: 6034
Which GPU you are using in your computer? Your code works fine with a GPU card in it. I increased the batch size to 128 and removed the print('Hello')
statement and then I was successfully able to run the code.
Also, since you are just getting started with TensorFlow, GradientTape
is not the way to start learning TensorFlow. It is used only for customized training. Make use of model.fit()
to perform training or else start with these official tutorials.
Upvotes: 1
Reputation: 63
Just got the problem solved, it seems that there is something wrong with the latest version of NVIDIA's Game Ready Driver(440.97). As soon as I roll back to 436.48, code will continue training even though the error I mentioned above will still present.
Upvotes: 2