Reputation: 1809
Suppose I have two NumPy matrices (or Pandas DataFrames, though I'm guessing this will be faster in NumPy).
>>> arr1
array([[3, 1, 4],
[4, 3, 5],
[6, 5, 4],
[6, 5, 4],
[3, 1, 4]])
>>> arr2
array([[3, 1, 4],
[8, 5, 4],
[3, 1, 4],
[6, 5, 4],
[3, 1, 4]])
For every row-vector in arr1
, I want to count the occurrence of that row vector in arr2
and generate a vector of these counts. So for this example, the result would be
[3, 0, 1, 1, 3]
What is an efficient way to do this?
First approach:
The obvious approach of just using looping over the row-vectors of arr1
and generating a corresponding boolean vector on arr2
seems very slow.
np.apply_along_axis(lambda x: (x == arr2).all(1).sum(), axis=1, arr=arr1)
And it seems like a bad algorithm, as I have to check the same rows multiple times.
Second approach: I could store the row counts in a collections.Counter, and then just access that with apply_along_axis
.
cnter = Counter(tuple(row) for row in arr2)
np.apply_along_axis(lambda x: cnter[tuple(x)], axis=1, arr=arr1)
This seems to be somewhat faster, but I feel like there has to still be a more direct approach than this.
Upvotes: 2
Views: 2734
Reputation: 29711
numpy:
Start off with gathering the linear index equivalents to row and column subscripts of a2
using np.ravel_multi_index
. Add 1 to account for the 0-based indexing of numpy. Get the respective counts for the unique rows present through np.unique()
. Next, find matching rows between the unique rows of a2
and a1
by extending a1
to a new dimension towards the right-axis (also known as broadcasting) and extract indices of non-zero rows for both the arrays.
Initialize an array of zeros and fill it's values by slicing based on the obtained indices.
def comp2D_vect(a1, a2):
midx = np.ravel_multi_index(a2.T, a2.max(0)+1)
a, idx, cnt = np.unique(midx, return_counts=True, return_index=True)
m1, m2 = (a1[:, None] == a2[idx]).all(-1).nonzero()
out = np.zeros(a1.shape[0], dtype=int)
out[m1] = cnt[m2]
return out
benchmarks:
For: a2 = a2.repeat(100000, axis=0)
%%timeit
df = pd.DataFrame(a2, columns=['a', 'b', 'c'])
df_count = df.groupby(df.columns.tolist()).size()
df_count.reindex(a1.T.tolist(), fill_value=0).values
10 loops, best of 3: 67.2 ms per loop # @ Ted Petrou's solution
%timeit comp2D_vect(a1, a2)
10 loops, best of 3: 34 ms per loop # Posted solution
%timeit searchsorted_based(a1,a2)
10 loops, best of 3: 27.6 ms per loop # @ Divakar's solution (winner)
Upvotes: 2
Reputation: 221524
Here's a NumPy approach after converting the inputs to 1D equivalents and then sorting and using np.searchsorted
alongwith np.bincount
for the counting -
def searchsorted_based(a,b):
dims = np.maximum(a.max(0), b.max(0))+1
a1D = np.ravel_multi_index(a.T,dims)
b1D = np.ravel_multi_index(b.T,dims)
unq_a1D, IDs = np.unique(a1D, return_inverse=1)
fidx = np.searchsorted(unq_a1D, b1D)
fidx[fidx==unq_a1D.size] = 0
mask = unq_a1D[fidx] == b1D
count = np.bincount(fidx[mask])
out = count[IDs]
return out
Sample run -
In [308]: a
Out[308]:
array([[3, 1, 4],
[4, 3, 5],
[6, 5, 4],
[6, 5, 4],
[3, 1, 4]])
In [309]: b
Out[309]:
array([[3, 1, 4],
[8, 5, 4],
[3, 1, 4],
[6, 5, 4],
[3, 1, 4],
[2, 1, 5]])
In [310]: searchsorted_based(a,b)
Out[310]: array([3, 0, 1, 1, 3])
Runtime test -
In [377]: A = a[np.random.randint(0,a.shape[0],(1000))]
In [378]: B = b[np.random.randint(0,b.shape[0],(1000))]
In [379]: np.allclose(comp2D_vect(A,B), searchsorted_based(A,B))
Out[379]: True
# @Nickil Maveli's soln
In [380]: %timeit comp2D_vect(A,B)
10000 loops, best of 3: 184 µs per loop
In [381]: %timeit searchsorted_based(A,B)
10000 loops, best of 3: 92.6 µs per loop
Upvotes: 3
Reputation: 61947
Pandas would be a good tool for this. You can put arr2 into a dataframe and use the groupby
method to count the number of occurences of each row and then reindex the result with arr1.
arr1=np.array([[3, 1, 4],
[4, 3, 5],
[6, 5, 4],
[6, 5, 4],
[3, 1, 4]])
arr2 = np.array([[3, 1, 4],
[8, 5, 4],
[3, 1, 4],
[6, 5, 4],
[3, 1, 4]])
df = pd.DataFrame(arr2, columns=['a', 'b', 'c'])
df_count = df.groupby(df.columns.tolist()).size()
df_count.reindex(arr1.T.tolist(), fill_value=0)
Output
a b c
3 1 4 3
4 3 5 0
6 5 4 1
4 1
3 1 4 3
dtype: int64
Timings
Create a lot more data first
arr2_2 = arr2.repeat(100000, axis=0)
Now time it:
%%timeit
cnter = Counter(tuple(row) for row in arr2_2)
np.apply_along_axis(lambda x: cnter[tuple(x)], axis=1, arr=arr1)
1 loop, best of 3: 704 ms per loop
%%timeit
df = pd.DataFrame(arr2_2, columns=['a', 'b', 'c'])
df_count = df.groupby(df.columns.tolist()).size()
df_count.reindex(arr1.T.tolist(), fill_value=0)
10 loops, best of 3: 53.8 ms per loop
Upvotes: 1