John Rouhana
John Rouhana

Reputation: 548

Finding Intersection Between a Matrix and a Vector, by Row

Consider the following:

tmp1 = ['a', 'b', 'c', 'd', 'e']
tmp2 = ['f', 'g', 'h', 'b', 'd']
tmp3 = ['b', 'i', 'j', 'k', 'l']
matr = np.array([tmp1, tmp2, tmp3])

matr

Yields a matrix:

array([['a', 'b', 'c', 'd', 'e'],
   ['f', 'g', 'h', 'b', 'd'],
   ['b', 'i', 'j', 'k', 'l']], 
  dtype='|S1')

Now, I want to know the sum of values in each row that intersects a vector. Say,

vec = ['a', 'c', 'f', 'b']
[sum([y in vec for y in row]) for row in matr]

Returns,

[3, 2, 1]

This is the desired output. The problem with it is that my 'matr' is actually ≈ 1000000 x 2200, and I have 6700 vectors to compare against. The solution I have here is far too slow to attempt.

How can I improve what I'm doing?

It's worth noting that the values inside of the matr come from a set of ~30000 values, and I have the full set. I've considered solutions where I make a dict of these 30000 values against each vector, and use the dict to convert to True/False throughout the matrix before just summing by row. I'm not sure if this will help.

Upvotes: 3

Views: 275

Answers (4)

amanb
amanb

Reputation: 5463

You could use set intersection to speed things up a bit. Here's a comparison:

Your present solution with list comprehensions:

%%timeit
print([sum([y in vec for y in row]) for row in matr])
#Output
[3,2,1]
20 µs ± 1.9 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Proposed solution with set intersection in list comprehension:

%%timeit
print([len(set(row).intersection(vec)) for row in matr])
#Output:
[3,2,1]
17.8 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And if the vec is also a set, we get even better efficiency:

%%timeit
vec = {'a', 'c', 'f', 'b'}
print([len(set(row).intersection(vec)) for row in matr])
#Output:
[3, 2, 1]
16.6 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Upvotes: 2

frodo2975
frodo2975

Reputation: 11725

Let's look at the speed of your current algorithm. According to the python wiki, checking if an item is in an array like y in vec is O(n), meaning worst case, it has to go through every element in vec. Since you're doing that check for every element of your matrix, your total number of operations is numRows * numCols * vecLen, which is O(n^3).

A faster way to do it is to construct a dictionary for vec to optimize lookups because dictionaries are O(1) instead of O(n), meaning they can do your check in 1 operation, no matter how long vec is:

vecDict = dict([(x, 1) for x in vec])

So, your new time complexity is (numRows * numCols) + vecLen, which is O(n^2), which I think is as fast as you can get for your data.

[sum([y in vecDict for y in row]) for row in matr]

Upvotes: 1

Divakar
Divakar

Reputation: 221504

For matr and vec as arrays, here's one with np.searchsorted -

def count_in_rowwise(matr,vec):
    sidx = vec.argsort()
    idx = np.searchsorted(vec,matr,sorter=sidx)
    idx[idx==len(vec)] = 0
    return (vec[sidx[idx]] == matr).sum(1)

With a comparatively smaller vec, we can pre-sort it and use, to give us an alternative one to compute the row-counts, like so -

def count_in_rowwise_v2(matr,vec,assume_sorted=False):
    if assume_sorted==1:
        sorted_vec = vec
    else:
        sorted_vec = np.sort(vec)
    idx = np.searchsorted(sorted_vec,matr)
    idx[idx==len(sorted_vec)] = 0
    return (sorted_vec[idx] == matr).sum(1)

The above solution(s) works on generic inputs(numbers or strings alike). To solve our specific case of strings, we could optimize it further by converting the strings to numbers by using np.unique and then re-using count_in_rowwise/count_in_rowwise_v2 and that will give us our second approach, like so -

u,ids = np.unique(matr, return_inverse=True)
out = count_in_rowwise(ids.reshape(matr.shape),ids[np.searchsorted(u,vec)])

Upvotes: 2

alkasm
alkasm

Reputation: 22992

Here's a simple readable solution with np.isin() (docs):

np.sum(np.isin(matr, vec), axis=1)

As a bonus, you can just use np.isin() without the summing if you want to get which elements of the matrix are in the vectors:

>>> np.isin(matr, vec)
array([[ True,  True,  True, False, False],
       [ True, False, False,  True, False],
       [ True, False, False, False, False]])

which shows why summing along the rows produces the desired output.

Upvotes: 1

Related Questions