Reputation: 3290
I want to understand the effect of resize()
function on numpy
array vs. an h5py dataset
. In my application, I am reading a text file line by line and then after parsing the data, write into an hdf5
file. What would be a good approach to implement this. Should I add each new row into a numpy
array and keep resizing (increasing the axis) for numpy array (eventually writing the complete numpy array into h5py dataset) or should I just add each new row data into h5py dataset
directly and thus resizing the h5py dataset
in memory. How does resize()
function affects the performance if we keep resizing after each row? Or should I resize after every 100 or 1000 rows?
There can be around 200,000 lines in each dataset.
Any help is appreciated.
Upvotes: 4
Views: 2633
Reputation: 20339
NumPy arrays are not designed to be resized. It's doable, but wasteful in terms of memory (because you need to create a second array larger than your first one, then fill it with your data... That's two arrays you have to keep) and of course in terms of time (creating the temporary array). You'd be better off starting with lists (or regular arrays, as suggested by @HYRY), then convert to ndarrays when you have a chunk big enough. The question is, when do you need to do the conversion ?
Upvotes: 2
Reputation: 97261
I think resize() will copy all the data in the array, it's slow if you call it repeatly.
If you want to append data into the array continuously, you can create a large array first, and use index to copy data into it.
Or you can use array object from array module, it's a dynamic array that behaves like list. after append all the data into array object, you can convert it to ndarray. Here is an example:
import array
import numpy as np
a = array.array("d")
a.extend([0,1,2])
a.extend([3,4,5])
b = np.frombuffer(a, np.float).reshape(-1, 3)
Upvotes: 3