Reputation: 1023
I am writing a few functions which walk down a directory tree, sort the files by name there and encode them + more information in a table.
This table I define as a Numpy vstack array. In the beginning it adding entries to the array is lightning fast but when it reaches around 20 000 entries, it slows a lot. So much, that to reach the target around 90 000 populated rows it takes it around 10min.
I highly doubt that vstack is the culprit, as maybe it is copying the whole table + the row I am appending to it. The official Numpy documentation says that vstacking is nothing but "concatenation"... but this does not answer my question.
Hence, is np.vstack() looking at the sizes of the arrays it is going to glue together, then mallocating that needed memory and copying in the contents of the arrays we are stacking?
Update: Just for the statistics ladies and gentlemen, using lists brought the execution time to 0.5s. That is more than 20 times faster, in reality it is even less than that because my measure includes some extra operations.
Upvotes: 5
Views: 2164
Reputation: 1598
You are right, np.vstack
copies the full arrays.
You can do a small python experience to confirm it:
a = np.array([[1,2,3]])
b = np.array([[4,5,6]])
res = np.vstack((a,b))
res
array([[1, 2, 3],
[4, 5, 6]])
Then if you modify the array a
and print res, you can see that res
is
not modified
a[0,2] = 19
res
array([[1, 2, 3],
[4, 5, 6]])
Upvotes: 8