Eric Hansen
Eric Hansen

Reputation: 1809

Comparing two matrices row-wise by occurrence in NumPy

Suppose I have two NumPy matrices (or Pandas DataFrames, though I'm guessing this will be faster in NumPy).

>>> arr1
array([[3, 1, 4],
       [4, 3, 5],
       [6, 5, 4],
       [6, 5, 4],
       [3, 1, 4]])
>>> arr2
array([[3, 1, 4],
       [8, 5, 4],
       [3, 1, 4],
       [6, 5, 4],
       [3, 1, 4]])

For every row-vector in arr1, I want to count the occurrence of that row vector in arr2 and generate a vector of these counts. So for this example, the result would be

[3, 0, 1, 1, 3]

What is an efficient way to do this?

First approach: The obvious approach of just using looping over the row-vectors of arr1 and generating a corresponding boolean vector on arr2 seems very slow.

np.apply_along_axis(lambda x: (x == arr2).all(1).sum(), axis=1, arr=arr1)

And it seems like a bad algorithm, as I have to check the same rows multiple times.

Second approach: I could store the row counts in a collections.Counter, and then just access that with apply_along_axis.

cnter = Counter(tuple(row) for row in arr2)
np.apply_along_axis(lambda x: cnter[tuple(x)], axis=1, arr=arr1)

This seems to be somewhat faster, but I feel like there has to still be a more direct approach than this.

Upvotes: 2

Views: 2734

Answers (3)

Nickil Maveli
Nickil Maveli

Reputation: 29711

numpy:

Start off with gathering the linear index equivalents to row and column subscripts of a2 using np.ravel_multi_index. Add 1 to account for the 0-based indexing of numpy. Get the respective counts for the unique rows present through np.unique(). Next, find matching rows between the unique rows of a2 and a1 by extending a1 to a new dimension towards the right-axis (also known as broadcasting) and extract indices of non-zero rows for both the arrays.

Initialize an array of zeros and fill it's values by slicing based on the obtained indices.

def comp2D_vect(a1, a2):
    midx = np.ravel_multi_index(a2.T, a2.max(0)+1)
    a, idx, cnt = np.unique(midx, return_counts=True, return_index=True)
    m1, m2 = (a1[:, None] == a2[idx]).all(-1).nonzero()
    out = np.zeros(a1.shape[0], dtype=int)
    out[m1] = cnt[m2]
    return out

benchmarks:

For: a2 = a2.repeat(100000, axis=0)

%%timeit
df = pd.DataFrame(a2, columns=['a', 'b', 'c'])
df_count = df.groupby(df.columns.tolist()).size()
df_count.reindex(a1.T.tolist(), fill_value=0).values
10 loops, best of 3: 67.2 ms per loop    # @ Ted Petrou's solution

%timeit comp2D_vect(a1, a2)
10 loops, best of 3: 34 ms per loop      # Posted solution

%timeit searchsorted_based(a1,a2)
10 loops, best of 3: 27.6 ms per loop    # @ Divakar's solution (winner)

Upvotes: 2

Divakar
Divakar

Reputation: 221524

Here's a NumPy approach after converting the inputs to 1D equivalents and then sorting and using np.searchsorted alongwith np.bincount for the counting -

def searchsorted_based(a,b):      
    dims = np.maximum(a.max(0), b.max(0))+1

    a1D = np.ravel_multi_index(a.T,dims)
    b1D = np.ravel_multi_index(b.T,dims)

    unq_a1D, IDs = np.unique(a1D, return_inverse=1)
    fidx = np.searchsorted(unq_a1D, b1D)
    fidx[fidx==unq_a1D.size] = 0
    mask = unq_a1D[fidx] == b1D 

    count = np.bincount(fidx[mask])
    out = count[IDs]
    return out

Sample run -

In [308]: a
Out[308]: 
array([[3, 1, 4],
       [4, 3, 5],
       [6, 5, 4],
       [6, 5, 4],
       [3, 1, 4]])

In [309]: b
Out[309]: 
array([[3, 1, 4],
       [8, 5, 4],
       [3, 1, 4],
       [6, 5, 4],
       [3, 1, 4],
       [2, 1, 5]])

In [310]: searchsorted_based(a,b)
Out[310]: array([3, 0, 1, 1, 3])

Runtime test -

In [377]: A = a[np.random.randint(0,a.shape[0],(1000))]

In [378]: B = b[np.random.randint(0,b.shape[0],(1000))]

In [379]: np.allclose(comp2D_vect(A,B), searchsorted_based(A,B))
Out[379]: True

# @Nickil Maveli's soln
In [380]: %timeit comp2D_vect(A,B)
10000 loops, best of 3: 184 µs per loop

In [381]: %timeit searchsorted_based(A,B)
10000 loops, best of 3: 92.6 µs per loop

Upvotes: 3

Ted Petrou
Ted Petrou

Reputation: 61947

Pandas would be a good tool for this. You can put arr2 into a dataframe and use the groupby method to count the number of occurences of each row and then reindex the result with arr1.

arr1=np.array([[3, 1, 4],
       [4, 3, 5],
       [6, 5, 4],
       [6, 5, 4],
       [3, 1, 4]])

arr2 = np.array([[3, 1, 4],
       [8, 5, 4],
       [3, 1, 4],
       [6, 5, 4],
       [3, 1, 4]])

df = pd.DataFrame(arr2, columns=['a', 'b', 'c'])
df_count = df.groupby(df.columns.tolist()).size()
df_count.reindex(arr1.T.tolist(), fill_value=0)

Output

a  b  c
3  1  4    3
4  3  5    0
6  5  4    1
      4    1
3  1  4    3
dtype: int64

Timings
Create a lot more data first

arr2_2 = arr2.repeat(100000, axis=0)

Now time it:

%%timeit
cnter = Counter(tuple(row) for row in arr2_2)
np.apply_along_axis(lambda x: cnter[tuple(x)], axis=1, arr=arr1)

1 loop, best of 3: 704 ms per loop

%%timeit
df = pd.DataFrame(arr2_2, columns=['a', 'b', 'c'])
df_count = df.groupby(df.columns.tolist()).size()
df_count.reindex(arr1.T.tolist(), fill_value=0)

10 loops, best of 3: 53.8 ms per loop

Upvotes: 1

Related Questions