Reputation: 459
I am trying to compute the number of occurrences of pairs of values. When running the following code the numpy version (pairs_frequency2) is more than 50% slower than the version relying on collections.Counter (it gets worse when the number of points increases). Could someone please explain the reason why.
Is there a possible numpy rewrite to achieve better performance ?
Thanks in advance.
import numpy as np
from collections import Counter
def pairs_frequency(x, y):
counts = Counter(zip(x, y))
res = np.array([[f, a, b] for ((a, b), f) in counts.items()])
return res[:, 0], res[:, 1], res[:, 2]
def pairs_frequency2(x, y):
unique, counts = np.unique(np.column_stack((x,y)), axis=0, return_counts=True)
return counts, unique[:,0], unique[:,1]
x = np.random.randint(low=1, high=11, size=50000)
y = x + np.random.randint(1, 5, size=x.size)
%timeit pairs_frequency(x, y)
%timeit pairs_frequency2(x, y)
Upvotes: 0
Views: 1458
Reputation: 114831
numpy.unique
sorts its argument, so its time complexity is O(n*log(n)). It looks like the Counter
class could be O(n).
If the values in your arrays are nonnegative integers that are not too big, this version is pretty fast:
def pairs_frequency3(x, y, maxval=15):
z = maxval*x + y
counts = np.bincount(z)
pos = counts.nonzero()[0]
ux, uy = np.divmod(pos, maxval)
return counts[pos], ux, uy
Set maxval
to the 1 plus the maximum value in x
and y
. (You could remove the argument, and add code to find the maximum in the function.)
Timing (x
and y
were generated as in the question):
In [13]: %timeit pairs_frequency(x, y)
13.8 ms ± 77.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [14]: %timeit pairs_frequency2(x, y)
32.9 ms ± 631 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]: %timeit pairs_frequency3(x, y)
129 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Note the change in time units of the third result.
pairs_frequency3
returns the arrays in the same order as pairs_frequency2
, so it is easy to verify that they return the same values:
In [26]: counts2, x2, y2 = pairs_frequency2(x, y)
In [27]: counts3, x3, y3 = pairs_frequency3(x, y)
In [28]: np.all(counts2 == counts3) and np.all(x2 == x3) and np.all(y2 == y3)
Out[28]: True
Upvotes: 2