efficient way to resize numpy or dataset?

Question

I want to understand the effect of resize() function on numpy array vs. an h5py dataset. In my application, I am reading a text file line by line and then after parsing the data, write into an hdf5 file. What would be a good approach to implement this. Should I add each new row into a numpy array and keep resizing (increasing the axis) for numpy array (eventually writing the complete numpy array into h5py dataset) or should I just add each new row data into h5py dataset directly and thus resizing the h5py dataset in memory. How does resize() function affects the performance if we keep resizing after each row? Or should I resize after every 100 or 1000 rows?

There can be around 200,000 lines in each dataset.

Any help is appreciated.

HYRY · Accepted Answer

I think resize() will copy all the data in the array, it's slow if you call it repeatly.

If you want to append data into the array continuously, you can create a large array first, and use index to copy data into it.

Or you can use array object from array module, it's a dynamic array that behaves like list. after append all the data into array object, you can convert it to ndarray. Here is an example:

import array
import numpy as np
a = array.array("d")
a.extend([0,1,2])
a.extend([3,4,5])
b = np.frombuffer(a, np.float).reshape(-1, 3)

efficient way to resize numpy or dataset?

Answers (2)

Related Questions