Reputation: 11
I'm training my keras dense models on very large datasets.
For practical reasons, I am saving them on my disk on separate .txt files. I have 1e4 text files, each containing 1e4 examples.
I would like to find a way to fit my keras model on this dataset as a whole. For now, I am only able to use "model.fit" on individual text files, i.e. :
for k in range(10000):
X = np.loadtxt('/path/X_'+str(k)+'.txt')
Y = np.loadtxt('/path/Y_'+str(k)+'.txt')
mod = model.fit(x=X, y=Y, batch_size=batch_size, epochs=epochs)
Which is problematic if I want for instance to perform several epochs on the whole datasets.
Ideally, I would like to have a dataloader function that could be used in the following way to feed all the sub-datasets as a single one:
mod = model.fit(dataloader('/path/'), batch_size=batch_size, epochs=epochs)
I think I found what I want, but only for datasets composed of images: tf.keras.preprocessing.image.ImageDataGenerator.flow_from_directory
Is there any tf/keras function doing something similar, but for datasets which are not composed of images?
Thanks!
Upvotes: 1
Views: 317
Reputation: 1508
You can create a generator function and then use tensorflow Dataset class using from_generator method to create a dataset, see bellow a dummy example:
def mygenerator():
for k in range(1000):
x = np.random.normal(size=1000,)
y = np.random.randint(low=0, high=5, size=1000)
yield x, y
from tensorflow.data import Dataset
mydataset = Dataset.from_generator(mygenerator, output_signature=(tf.TensorSpec(shape=(1000,), dtype=tf.float32), tf.TensorSpec(shape=(1000,), dtype=tf.int32)))
mytraindata = mydataset.batch(batch_size)
Upvotes: 2