Reputation: 1051
I am using TensorFlow V1.7 with the new high-level Estimator interface. I was able to create and train my own network with my own dataset.
However, the policy I use to I load images just doesn't seem right to me. The approach I have used so far (largely inspired by the MNIST tutorial) is to load all images into memory since the beginning (here is a tiny code snippet to give you an idea):
for filename in os.listdir(folder):
filepath = os.path.join(folder, filename)
# using OpenCV to read image
images.append(cv2.imread(filepath, cv2.IMREAD_GRAYSCALE))
labels.append(<corresponding label>)
# shuffle samples and labels in the same way
temp = list(zip(images, labels))
random.shuffle(temp)
images, labels = zip(*temp)
return images, labels
This means that I have to load into memory all my training set, containing something like 32k images, before training the net. However since my batch size is 100 the net will not need more than 100 images at a time.
This approach seems quite weird to me. I understand that this way secondary memory is only accessed once, maximizing the performances; however, if my dataset was really big, this could overload my RAM, couldn't it?
As a consequence, I would like to use a lazy approach, only loading images when they are needed (i.e. when they happen to be in a batch). How can I do this? I have searched the TF documentation, but I have not found anything so far.
Is there something I'm missing?
Upvotes: 2
Views: 2535
Reputation: 1130
It's advised to use the Dataset module, which provides you the ability (among other things) to use queues, prefetching of a small number of examples to memory, number of threads and much more.
Upvotes: 2