How to save big array so that it will take less memory in python?

Question

I am new to python. I have a big array, a, with dimensions such as (43200, 4000) and I need to save this, as I need it for future processing. when I try to save it with a np.savetxt, the txt file is too large and my program runs into memory error as I need to process 5 files of same size. Is there any way to save huge arrays so that it will take less memory?

Thanks.

ely · Accepted Answer

You can use PyTables to create a Hierarchical Data Format (HDF) file to store the data. This provides some interesting in-memory options that link the object you're working with to the file it's saved in.

Here is another StackOverflow questions that demonstrates how to do this: "How to store a NumPy multidimensional array in PyTables."

If you are willing to work with your array as a Pandas DataFrame object, you can also use the Pandas interface to PyTables / HDF5, e.g.:

import pandas
import numpy as np
a = np.ones((43200, 4000)) # Not recommended.
x = pandas.HDFStore("some_file.hdf")

x.append("a", pandas.DataFrame(a)) # <-- This will take a while.
x.close()

# Then later on...
my_data = pandas.HDFStore("some_file.hdf") # might also take a while
usable_a_copy = my_data["a"] # Be careful of the way changes to
                             # `usable_a_copy` affect the saved data.

copy_as_nparray = usable_a_copy.values

With files of this size, you might consider whether your application can be performed with a parallel algorithm and potentially applied to only subsets of the large arrays rather than needing to consume all of the array before proceeding.

How to save big array so that it will take less memory in python?

Answers (2)

Related Questions