J Tovell
J Tovell

Reputation: 53

Saving large numpy 2d arrays

I have an array with ~1,000,000 rows, each of which is a numpy array of 4,800 float32 numbers. I need to save this as a csv file, however using numpy.savetxt has been running for 30 minutes and I don't know how much longer it will run for. Is there a faster method of saving the large array as a csv? Many thanks, Josh

Upvotes: 2

Views: 466

Answers (1)

Mad Physicist
Mad Physicist

Reputation: 114230

As pointed out in the comments, 1e6 rows * 4800 columns * 4 bytes per float32 is 18GiB. Writing a float to text takes ~9 bytes of text (estimating 1 for integer, 1 for decimal, 5 for mantissa and 2 for separator), which comes out to 40GiB. This takes a long time to do, since just the conversion to text itself is non-trivial, and disk I/O will be a huge bottle-neck.

One way to optimize this process may be to convert the entire array to text on your own terms, and write it in blocks using Python's binary I/O. I doubt that will give you too much benefit though.

A much better solution would be to write the binary data to a file instead of text. Aside from the obvious advantages of space and speed, binary has the advantage of being searchable and not requiring transformation after loading. You know where every individual element is in the file, if you are clever, you can access portions of the file without loading the entire thing. Finally, a binary file is more likely to be highly compressible than a relatively low-entropy text file.

Disadvantages of binary are that it is not human-readable, and not as portable as text. The latter is not a problem, since transforming into an acceptable format will be trivial. The former is likely a non-issue given the amount of data you are attempting to process anyway.

Keep in mind that human readability is a relative term. A human can not read 40iGB of numerical data with understanding. A human can process A) a graphical representation of the data, or B) scan through relatively small portions of the data. Both cases are suitable for binary representations. Case A) is straightforward: load, transform and plot the data. This will be much faster if the data is already in a binary format that you can pass directly to the analysis and plotting routines. Case B) can be handled with something like a memory mapped file. You only ever need to load a small portion of the file, since you can't really show more than say a thousand elements on screen at one time anyway. Any reasonable modern platform should be able to keep upI/O and binary-to-text conversion associated with a user scrolling around a table widget or similar. In fact, binary makes it easier since you know exactly where each element belongs in the file.

Upvotes: 2

Related Questions