Reputation: 36086
What is the most efficient way of incrementally building a numpy array, e.g. one row at a time, without knowing the final size in advance?
My use case is as follows. I need to load a large file (10-100M lines) for which each line requires string processing and should form a row of a numpy array.
Is it better to load the data to a temporary Python list and convert to an array or is there some existing mechanism in numpy that would make it more efficient?
Upvotes: 11
Views: 6171
Reputation: 5408
On my laptop 1 core intel i5 1.7 GHz:
%%timeit
l = [row]
for i in range(10000):
l.append(row)
n = np.array(l)
100 loops, best of 3: 5.54 ms per loop
My best try with pure numpy (maybe someones knows a better solution)
%%timeit
l = np.empty( (1e5,row.shape[1]) )
for i in range(10000):
l[i] = row
l = l[np.all(l > 1e-100, axis=1)]
10 loops, best of 3: 18.5 ms per loop
Upvotes: 3
Reputation: 7036
You should get better performance out of appending each row to a list and then converting to an ndarray afterward.
Here's a test where I append ndarrays to a list 10000 times and then generate a new ndarray when I'm done:
row = np.random.randint(0,100, size=(1,100))
And I time it with ipython notebook:
%%timeit
l = [row]
for i in range(10000):
l.append(row)
n = np.array(l)
-> 10 loops, best of 3: 132 ms per loop
And here's a test where I concatenate each row:
%%timeit
l = row
for i in range(10000):
l = np.concatenate((l, row),axis=0)
-> 1 loops, best of 3: 23.1 s per loop
Way slower.
The only issue with the first method is you will wind up with both the list and the array in memory at the same time, so you could potentially have RAM issues. You could avoid that by doing it in chunks.
Upvotes: 10