Reputation: 3355
I'm using keras to train a model on SageMaker, here's the code I'm using but I hit the error:
MemoryError: Unable to allocate 381. MiB for an array with shape (25000, 2000)
and data type float64
Here's the code:
import pandas as pd
import numpy as np
from keras.datasets import imdb
from keras import models, layers, optimizers, losses, metrics
import matplotlib.pyplot as plt
# load imbd preprocessed dataset
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(
num_words=2000)
# one-hot encoding all the integer into a binary matrix
def vectorize_sequences(sequences, dimension=2000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
Then I get the error.
The first time when I run this code it works but it failed when I tried to re-run it, how I can fix it by cleaning the memory or is there a way that I can use the memory on SageMaker?
Upvotes: 1
Views: 1991
Reputation: 36684
I wouldn't know about SageMaker or AWS specifically, but something you can do is cast your input to float32
, which takes less memory space. You can cast it like this:
train_data = tf.cast(train_data, tf.float32)
float32
is the default value of Tensorflow weights so you don't need float64
anyway. Proof:
import tensorflow as tf
layer = tf.keras.layers.Dense(8)
print(layer(tf.random.uniform((10, 100), 0, 1)).dtype)
<dtype: 'float32'>
My other suggestions are to get less words from your dataset, or to not one-hot encode them. If you're planning on training a recurrent model with an embedding layer, you won't need to anyway.
Upvotes: 4