Reputation: 4135
There are many, many questions about this on SO. The answers to all of them appear to be rather straight forward, pointing out it's almost certainly a memory error and that reducing the batch size should work.
In my case something else appears to be going on (or I am having a serious misunderstanding of how this work).
I have a large set of stimuli, like so:
train_x.shape # returns (2352, 131072, 2), amount 2.3k stimuli of size 131072x2
test_y.shape # returns (2352,)
Of course, we can imagine that this might be too much. Indeed, creating a simple model and not setting any batch size returns in the InternalError
.
model = Sequential([
Flatten(input_shape=(131072, 2)),
Dense(128, activation=tf.nn.relu),
Dense(50, activation=tf.nn.relu),
Dense(1),
])
model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
model.fit(train_x, train_y, epochs=5)
This returns the following error:
InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.
The logical thing to do is reduce the bath size. However, setting any value from 1 to 2000 simple returns the same error. This appears to imply that I don't have enough memory remaining to load a single stimuli. However..
If I manually cut up my dataset like so:
# Take first 20 stimuli
smaller_train_x = train_x[0:20,::] # shape is (20, 131072, 2)
smaller_trian_y = test_y[0:20] # shape is (20, )
If I try to fit the model to this smaller dataset, it works and does not return an error.
model.fit(smaller_train_x, smaller_trian_y, epochs=5)
Thus, setting a batch_size
of a single stimuli, I get a memory error. However, running on a manual cut of my dataset of 20 stimuli works fine.
As I understand it
# Load in one stimuli at a time
model.fit(train_x, train_y, epochs=5, batch_size=1)
should use ~20 times less memory then
# Load in 20 stimuli at a time
model.fit(smaller_train_x, smaller_trian_y, epochs=5)
How then does the first return an memory error?
I'm running this on a jupyer notebook with python version 3.8 and tensorFlow version 2.10.0
Upvotes: 0
Views: 604
Reputation: 17201
Based on the following experiments, size of train samples passed to model.fit(...)
also matters along with the batch_size
.
train_x: Peak GPU memory increased with batch_size but not linearly
model.fit(train_x, train_y, epochs=1, batch_size=10, callbacks= [MemoryPrintingCallback()])
#GPU memory details [current: 2.7969 gb, peak: 3.0 gb]
model.fit(train_x, train_y, epochs=1, batch_size=100, callbacks= [MemoryPrintingCallback()])
#GPU memory details [current: 2.7969 gb, peak: 3.0 gb]
model.fit(train_x, train_y, epochs=1, batch_size=1000, callbacks= [MemoryPrintingCallback()])
#GPU memory details [current: 2.7969 gb, peak: 4.0 gb]
smaller_train_x: Peak GPU lower than previous case for same batch size
model.fit(smaller_train_x, smaller_trian_y, epochs=1, batch_size=10, callbacks= [MemoryPrintingCallback()])
#GPU memory details [current: 0.5 gb, peak: 0.6348 gb]
Converting train_x to tfrecords seems optimal, and linear increase in GPU memory
dataset = dataset.batch(10)
model.fit(dataset, epochs=1,callbacks= [MemoryPrintingCallback()])
#GPU memory details [current: 0.5 gb, peak: 0.6348 gb]
dataset = dataset.batch(100)
model.fit(dataset, epochs=1,callbacks= [MemoryPrintingCallback()])
#GPU memory details [current: 0.5 gb, peak: 0.7228 gb]
dataset = dataset.batch(1000)
model.fit(dataset, epochs=1,callbacks= [MemoryPrintingCallback()])
#GPU memory details [current: 0.5 gb, peak: 1.6026 gb]
MemoryPrintingCallback()
: How to print the maximum memory used during Keras's model.fit()
numpy-to-tfrecords
: Numpy to TFrecords: Is there a more simple way to handle batch inputs from tfrecords?
Upvotes: 1