Reputation: 41
I am currently in a lab which uses iPython Notebook with python 2.7 for data processing. We work on pictures taken by a 285*384 pixels camera, with different parameters changing according to what we search to observe.Therefore, we need to deal with big matrices and as the data processing progress, the accumulation of matrices allocations makes the RAM / swap to be fullfilled and so we cannot go any further.
The typical initial data matrice is of size 100*285*384*16. Then we have to allocate numerous other matrices to calculate the temporal average corresponding to this matrice (of size 285*384*16, 100 being the temporal dimension), then we need to fit linearly the data so we have 2 100*285*384*16 matrices (2 estimated parameters needed for the linear fit), calculate the average and the standart deviation of those fits... and so on. So we allocate of lot of big matrices which leads to RAM / swap fullfilment. Also, we display some pictures associated with some of these matrices. Of course we could deallocate matrices as we go further in the data processing but we need to be able to change the code and see the results of old calculations without having to rebuilt all the code (calculations are sometimes pretty long). All results depend on the previous ones indeed, so we need to keep the data in the memory.
I would know wether there is some way to extend the swap memory (on the "physical" memory of a disk for example) or to by-pass our RAM limitations in any way with a smarter way of coding. Otherwise I would use a server of my laboratory institute that has 32 Go of RAM but it would be a loss of time and ergonomy for us to be unable to do it with our own computers. The crash occurs both in Macintosh and Windows and due to the limitations of RAM for windows in python I will probably try it with linux, but the 4Go of RAM of our computers will still be overfilled at some point.
I would really appreciate any help on this problem, I didn't find any answers on the net at this point. Thanks you in advance for your help.
Upvotes: 2
Views: 505
Reputation: 8124
You can drastically reduce you RAM requirements by storing the images to disk in HDF5 format using compression with pytables. Depending on your specific data, you can gain significant performances compared to an all-in-RAM approach.
The trick is to use the blazing fast blosc compression included in pytables.
As an example, this code creates an file containing multiple numpy arrays using blosc compression:
import tables
import numpy as np
img1 = np.arange(200*300*100)
img2 = np.arange(200*300*100)*10
h5file = tables.open_file("image_store.h5", mode = "w", title = "Example images",
filters=tables.Filters(complevel=5, complib='blosc'))
h5file.create_carray('/', 'image1', obj=img1, title = 'The image number 1')
h5file.create_carray('/', 'image2', obj=img2, title = 'The image number 2')
h5file.flush() # This makes sure everything is flushed to disk
h5file.close() # Closes the file, previous flush is redundant here.
and the following code snippet loads the two arrays back in RAM:
h5file = tables.open_file("image_store.h5") # By default it is a read-only open
img1 = h5file.root.image1[:] # Load in RAM image1 by using "slicing"
img2 = h5file.root.image2.read() # Load in RAM image1
Finally, if a single array is too big to fit in RAM, you can save and read it chunk-by-chunk using the conventional slicing notation. You create an (chunked) pytables array on disk with a preset size and type and then fill in chunks in this way:
h5file.create_carray('/', 'image_big', title = 'Big image',
atom=tables.Atom.from_dtype(np.dtype('uint16')),
shape=(200, 300, 400))
h5file.root.image_big[:100] = 1
h5file.root.image_big[100:200] = 2
h5file.flush()
Note that this time you don't provide a numpy array to pytables (obj
keyword) but you create an empty array, and therefore you need to specify shape and type (atom
).
For more info you can check out the official pytables documentation:
Upvotes: 2