Reputation: 467
I have the following problem, I have many files of 3D volumes that I open to extract a bunch of numpy arrays. I want to get those arrays randomly, i.e. in the worst case I open as many 3D volumes as numpy arrays I want to get, if all those arrays are in separate files. The IO here isn't great, I open a big file only to get a small numpy array from it. Any idea how I can store all these arrays so that the IO is better? I can't pre-read all the arrays and save them all in one file because then that file would be too big to open for RAM.
I looked up LMDB but it all seems to be about Caffe. Any idea how I can achieve this?
Upvotes: 0
Views: 1303
Reputation: 467
I iterated through my dataset, created an hdf5 file and stored elements in the hdf5. Turns out, when the hdf5 is opened, it doesn't load all data in ram, it loads the header instead. The header is then used to fetch the data on request, that's how I solved my problem.
Reference: http://www.machinelearninguru.com/deep_learning/data_preparation/hdf5/hdf5.html
Upvotes: 1
Reputation: 804
One trivial solution can be pre-processing your dataset and saving multiple smaller crops of the original 3D volumes separately. This way you sacrifice some disk space for more efficient IO.
Note that you can make a trade-off with the crop size here: saving bigger crops than you need for input allows you to still do random crop augmentation on the fly. If you save overlapping crops in the pre-processing step, then you can ensure that still all possible random crops of the original dataset can be produced.
Alternatively you may try using a custom data loader that retains the full volumes for a few batch. Be careful, this might create some correlation between batches. Since many machine learning algorithms relies on i.i.d samples (e.g. Stochastic Gradient Descent), correlated batches can easily cause some serious mess.
Upvotes: 0