Reputation: 1840
Question description
Lets say we have two simple arrays:
query = np.array([100, 4000, 500, 700, 400, 100])
match = np.array([6, 100, 4000, 100, 10, 8, 10])
I want to find the indexes of all matching values between the query and match. So in this case the result would be:
value query match
100 0 1
100 0 3
100 5 1
100 5 3
4000 1 2
In reality these arrays will contain millions of items
"Stupid" loop solution
qs = []
query_locs = []
match_locs = []
for i in np.arange(query.size):
q = query[i]
# Get matching indexes in "match"
match_loc = np.where(match == q)[0]
n = match_loc.size
# Update location arrays
match_locs.extend(match_loc)
query_locs.extend(np.repeat(i,n))
# Store the matching value
qs.extend(np.repeat(q,n))
result = np.vstack((qs, query_locs, match_locs)).T
print(result)
[[ 100 0 1]
[ 100 0 3]
[4000 1 2]
[ 100 5 1]
[ 100 5 3]]
(Maybe numba
could make this loop pretty fast however when I tried this I got some errors about the signatures, so not sure about that)
Numpy buildins
There are quite some buildin numpy function to solve this problem for unique values, like using searchsorted
, intersect1d
, however, as also described in the doc, they "Return the sorted, unique values" and thus do not take duplicates into account. Some examples on StackOverflow for this problem with unique values:
I could imagine there would be a faster way to do this with numpy instead of a loop, so curious to see an answer!
Upvotes: 4
Views: 660
Reputation: 471
You may transform 1d-arrays to dataframes and make a join, like this:
query = np.array([100, 4000, 500, 700, 400, 100])
match = np.array([6, 100, 4000, 100, 10, 8, 10])
dfquery = pd.DataFrame(range(len(query)), index=query, columns=['query'])
dfmatch = pd.DataFrame(range(len(match)), index=match, columns=['match'])
dfquery.join(dfmatch, how='inner')
Result:
query match
100 0 1
100 0 3
100 5 1
100 5 3
4000 1 2
Upvotes: 3
Reputation: 3453
You could hack around it with newaxis:
>>> comparison = np.equal(query[:, np.newaxis], match[np.newaxis, :])
array([[False, True, False, True, False, False, False],
[False, False, True, False, False, False, False],
[False, False, False, False, False, False, False],
[False, False, False, False, False, False, False],
[False, False, False, False, False, False, False],
[False, True, False, True, False, False, False]])
which essentially creates the cartesian product (query
x matches
) (note the memory cost) and then applies the binary function np.equal
to cast each element in the product space to a bool
efficiently.
The output can be interpreted by reading it row-wise as: query element i is equal to match element j whenever comparison[i, j]
is True
.
You can gather the indices of all True
pairs with:
list(zip(*comparison.nonzero()))
[(0, 1), (0, 3), (1, 2), (5, 1), (5, 3)]
ps: If the arrays are too long to construct the product, iterating over them element-wise is your only option.
Upvotes: 0