Reputation: 33
I'm pretty new to numpy and I'm trying to vectorize a simple for loop for performance reasons, but I can't seem to come up with a solution. I have a numpy array with unique words and for each of these words i need the number of times they occur in another numpy array, called array_to_compare. The number is passed to a third numpy array, which has the same shape as the unique words array. Here is the code which contains the for loop:
import numpy as np
unique_words = np.array(['a', 'b', 'c', 'd'])
array_to_compare = np.array(['a', 'b', 'a', 'd'])
vector_array = np.zeros(len(unique_words))
for word in np.nditer(unique_words):
counter = np.count_nonzero(array_to_compare == word)
vector_array[np.where(unique_words == word)] = counter
vector_array = [2. 1. 0. 1.] #the desired output
I tried it with np.where and np.isin, but did not get the desired result. I am thankful for any help!
Upvotes: 2
Views: 649
Reputation: 3857
I'd probably use a Counter
and a list comprehension to solve this:
In [1]: import numpy as np
...:
...: unique_words = np.array(['a', 'b', 'c', 'd'])
...: array_to_compare = np.array(['a', 'b', 'a', 'd'])
In [2]: from collections import Counter
In [3]: counter = Counter(array_to_compare)
In [4]: counter
Out[4]: Counter({'a': 2, 'b': 1, 'd': 1})
In [5]: vector_array = np.array([counter[key] for key in unique_words])
In [6]: vector_array
Out[6]: array([2, 1, 0, 1])
Assembling the Counter
is done in linear time and iterating through your unique_words
is also linear.
Upvotes: 2
Reputation: 231335
A numpy
comparison of array values using broadcasting
:
In [76]: unique_words[:,None]==array_to_compare
Out[76]:
array([[ True, False, True, False],
[False, True, False, False],
[False, False, False, False],
[False, False, False, True]])
In [77]: (unique_words[:,None]==array_to_compare).sum(1)
Out[77]: array([2, 1, 0, 1])
In [78]: timeit (unique_words[:,None]==array_to_compare).sum(1)
9.5 µs ± 2.79 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
But Counter
is also a good choice:
In [72]: %%timeit
...: c=Counter(array_to_compare)
...: [c[key] for key in unique_words]
12.7 µs ± 30.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Your use of count_nonzero
can be improved with
In [73]: %%timeit
...: words=unique_words.tolist()
...: vector_array = np.zeros(len(words))
...: for i,word in enumerate(words):
...: counter = np.count_nonzero(array_to_compare == word)
...: vector_array[i] = counter
...:
23.4 µs ± 505 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Iteration on lists is faster than on arrays (nditer
doesn't add much). And enumerate
lets us skip the where
test.
Upvotes: 1
Reputation: 5560
Similar to @DanielLenz's answer, but using np.unique
to create a dict
:
import numpy as np
unique_words = np.array(['a', 'b', 'c', 'd'])
array_to_compare = np.array(['a', 'b', 'a', 'd'])
counts = dict(zip(*np.unique(array_to_compare, return_counts=True)))
result = np.array([counts[word] if word in counts else 0 for word in unique_words])
[2 1 0 1]
Upvotes: 1