Reputation: 548
Consider the following:
tmp1 = ['a', 'b', 'c', 'd', 'e']
tmp2 = ['f', 'g', 'h', 'b', 'd']
tmp3 = ['b', 'i', 'j', 'k', 'l']
matr = np.array([tmp1, tmp2, tmp3])
matr
Yields a matrix:
array([['a', 'b', 'c', 'd', 'e'],
['f', 'g', 'h', 'b', 'd'],
['b', 'i', 'j', 'k', 'l']],
dtype='|S1')
Now, I want to know the sum of values in each row that intersects a vector. Say,
vec = ['a', 'c', 'f', 'b']
[sum([y in vec for y in row]) for row in matr]
Returns,
[3, 2, 1]
This is the desired output. The problem with it is that my 'matr' is actually ≈ 1000000 x 2200, and I have 6700 vectors to compare against. The solution I have here is far too slow to attempt.
How can I improve what I'm doing?
It's worth noting that the values inside of the matr come from a set of ~30000 values, and I have the full set. I've considered solutions where I make a dict of these 30000 values against each vector, and use the dict to convert to True/False throughout the matrix before just summing by row. I'm not sure if this will help.
Upvotes: 3
Views: 275
Reputation: 5463
You could use set intersection to speed things up a bit. Here's a comparison:
Your present solution with list comprehensions:
%%timeit
print([sum([y in vec for y in row]) for row in matr])
#Output
[3,2,1]
20 µs ± 1.9 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Proposed solution with set intersection in list comprehension:
%%timeit
print([len(set(row).intersection(vec)) for row in matr])
#Output:
[3,2,1]
17.8 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
And if the vec
is also a set, we get even better efficiency:
%%timeit
vec = {'a', 'c', 'f', 'b'}
print([len(set(row).intersection(vec)) for row in matr])
#Output:
[3, 2, 1]
16.6 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Upvotes: 2
Reputation: 11725
Let's look at the speed of your current algorithm. According to the python wiki, checking if an item is in an array like y in vec
is O(n), meaning worst case, it has to go through every element in vec
. Since you're doing that check for every element of your matrix, your total number of operations is numRows * numCols * vecLen
, which is O(n^3)
.
A faster way to do it is to construct a dictionary for vec
to optimize lookups because dictionaries are O(1)
instead of O(n)
, meaning they can do your check in 1 operation, no matter how long vec is:
vecDict = dict([(x, 1) for x in vec])
So, your new time complexity is (numRows * numCols) + vecLen
, which is O(n^2)
, which I think is as fast as you can get for your data.
[sum([y in vecDict for y in row]) for row in matr]
Upvotes: 1
Reputation: 221504
For matr
and vec
as arrays, here's one with np.searchsorted
-
def count_in_rowwise(matr,vec):
sidx = vec.argsort()
idx = np.searchsorted(vec,matr,sorter=sidx)
idx[idx==len(vec)] = 0
return (vec[sidx[idx]] == matr).sum(1)
With a comparatively smaller vec
, we can pre-sort it and use, to give us an alternative one to compute the row-counts, like so -
def count_in_rowwise_v2(matr,vec,assume_sorted=False):
if assume_sorted==1:
sorted_vec = vec
else:
sorted_vec = np.sort(vec)
idx = np.searchsorted(sorted_vec,matr)
idx[idx==len(sorted_vec)] = 0
return (sorted_vec[idx] == matr).sum(1)
The above solution(s) works on generic inputs(numbers or strings alike). To solve our specific case of strings, we could optimize it further by converting the strings to numbers by using np.unique
and then re-using count_in_rowwise/count_in_rowwise_v2
and that will give us our second approach, like so -
u,ids = np.unique(matr, return_inverse=True)
out = count_in_rowwise(ids.reshape(matr.shape),ids[np.searchsorted(u,vec)])
Upvotes: 2
Reputation: 22992
Here's a simple readable solution with np.isin()
(docs):
np.sum(np.isin(matr, vec), axis=1)
As a bonus, you can just use np.isin()
without the summing if you want to get which elements of the matrix are in the vectors:
>>> np.isin(matr, vec)
array([[ True, True, True, False, False],
[ True, False, False, True, False],
[ True, False, False, False, False]])
which shows why summing along the rows produces the desired output.
Upvotes: 1