Reputation: 742
I have two numpy arrays. The first, Z1, is about 300,000 rows long and 3 columns wide. The second, Z2, is about 200,000 rows and 300 columns. Each row of each Z1 and Z2 has an identifying number (10-digit). Z2 contains a subset of the items in Z1, and I want to match the rows in Z2 with their partners in Z1 based on the 10-digit identifying number, then take columns 2 and 3 from Z1 and insert them at the end of Z2 in their appropriate rows.
Neither Z1 nor Z2 are in any particular order.
The only way I've come up with to do this is by iterating over the arrays, which takes hours. Is there a better way to do this in Python?
Thanks!
Upvotes: 4
Views: 4670
Reputation: 67427
I understand from your question that the 10-digit identifier is stored in column 1, right?
This is not very easy to follow, a lot of indirection going on, but in the end unsorted_insert
has the row numbers of where in Z1
each identifier of Z2
is
sort_idx = np.argsort(Z1[:, 0])
sorted_insert = np.searchsorted(Z1[:, 0], Z2[:, 0], sorter=sort_idx)
# The following is equivalent to unsorted_insert = sort_idx[sorted_insert] but faster
unsorted_insert = np.take(sort_idx, sorted_insert)
So now all we need to do is to fetch the last two columns of those rows and stack them to the Z2
array:
new_Z2 = np.hstack((Z2, Z1[unsorted_insert, 1:]))
A made up example that runs with no issues:
import numpy as np
z1_rows, z1_cols = 300000, 3
z2_rows, z2_cols = 200000, 300
z1 = np.arange(z1_rows*z1_cols).reshape(z1_rows, z1_cols)
z2 = np.random.randint(10000, size=(z2_rows, z2_cols))
z2[:, 0] = z1[np.random.randint(z1_rows, size=(z2_rows,)), 0]
sort_idx = np.argsort(z1[:, 0])
sorted_insert = np.searchsorted(z1[:, 0], z2[:, 0], sorter=sort_idx)
# The following is equivalent to unsorted_insert = sort_idx[sorted_insert] but faster
unsorted_insert = np.take(sort_idx, sorted_insert)
new_z2 = np.hstack((z2, z1[unsorted_insert, 1:]))
Haven't timed it, but the whole thing seems to complete in under a couple of seconds.
Upvotes: 3