godfryd
godfryd

Reputation: 582

from_tensor_slices() with big numpy array while using tf.keras

I have some training data in a numpy array - it fits in the memory but it is bigger than 2GB. I'm using tf.keras and the dataset API. To give you a simplified, self-contained example:

import numpy as np
import tensorflow as tf
from tensorflow.keras import layers

model = tf.keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(32,)),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)
])

model.compile(optimizer=tf.train.AdamOptimizer(0.001),
          loss='mse',
          metrics=['mae'])

# generate some big input datasets, bigger than 2GB
data = np.random.random((1024*1024*8, 32))
labels = np.random.random((1024*1024*8, 1))
val_data = np.random.random((100, 32))
val_labels = np.random.random((100, 1))

train_dataset = tf.data.Dataset.from_tensor_slices((data, labels))
train_dataset = train_dataset.batch(32).repeat()

val_dataset = tf.data.Dataset.from_tensor_slices((val_data, val_labels))
val_dataset = val_dataset.batch(32).repeat()

model.fit(train_dataset, epochs=10, steps_per_epoch=30,
      validation_data=val_dataset, validation_steps=3)

So, executing this results in an error "Cannot create a tensor proto whose content is larger than 2GB". The documentation lists a solution to this problem: https://www.tensorflow.org/guide/datasets#consuming_numpy_arrays - just use tf.placeholders and then feed_dict in session run.

Now the main question is: how to do this with tf.keras? I cannot feed anything for the placeholders when I call model.fit() and in fact when I introduced the placeholders I got errors saying "You must feed a value for placeholder tensor".

Upvotes: 3

Views: 2711

Answers (1)

Sharky
Sharky

Reputation: 4533

As with Estimator API, you can use from_generator

data_chunks = list(np.split(data, 1024))
labels_chunks = list(np.split(labels, 1024))

def genenerator():
    for i, j in zip(data_chunks, labels_chunks):
        yield i, j

train_dataset = tf.data.Dataset.from_generator(genenerator, (tf.float32, tf.float32))
train_dataset = train_dataset.shuffle().batch().repeat()

Also take a look https://github.com/tensorflow/tensorflow/issues/24520

Upvotes: 4

Related Questions