Reputation: 85
I have a numpy array such as this
[[ 0, 57],
[ 7, 72],
[ 2, 51],
[ 8, 67],
[ 4, 42]]
I want to find out for each row, how many elements in the 2nd column are within a certain distance (say, 10) of the 2nd column value for that row. So in this example, here the solution would be
[[ 0, 57, 3],
[ 7, 72, 2],
[ 2, 51, 3],
[ 8, 67, 3],
[ 4, 42, 2]]
So [first row, third column] is 3, because there are 3 elements in the 2nd column (57,51,67) which are within distance 10 from 57. Similarly for each row
Any help would be appreciated!
Upvotes: 2
Views: 1183
Reputation: 353459
Here's a non-broadcasting approach, which takes advantage of the fact that to know how many numbers are within 3 of 10, you can subtract the number of numbers <= 13 from those strictly less than 7.
import numpy as np
def broadcast(x, width):
# for comparison
return (np.abs(x[:,None] - x) <= width).sum(1)
def largest_leq(arr, x, allow_equal=True):
maybe = np.searchsorted(arr, x)
maybe = maybe.clip(0, len(arr) - 1)
above = arr[maybe] > x if allow_equal else arr[maybe] >= x
maybe[above] -= 1
return maybe
def faster(x, width):
uniq, inv, counts = np.unique(x, return_counts=True, return_inverse=True)
counts = counts.cumsum()
low_bounds = uniq - width
low_ix = largest_leq(uniq, low_bounds, allow_equal=False)
low_counts = counts[low_ix]
low_counts[low_ix < 0] = 0
high_bounds = uniq + width
high_counts = counts[largest_leq(uniq, high_bounds)]
delta = high_counts - low_counts
out = delta[inv]
return out
This passes my tests:
for width in range(1, 10):
for window in range(5):
for trial in range(10):
x = np.random.randint(0, 10, width)
b = broadcast(x, window).tolist()
f = faster(x, window).tolist()
assert b == f
and behaves pretty well even at larger sizes:
In [171]: x = np.random.random(10**6)
In [172]: %time faster(x, 0)
Wall time: 386 ms
Out[172]: array([1, 1, 1, ..., 1, 1, 1], dtype=int64)
In [173]: %time faster(x, 1)
Wall time: 372 ms
Out[173]: array([1000000, 1000000, 1000000, ..., 1000000, 1000000, 1000000], dtype=int64)
In [174]: x = np.random.randint(0, 10, 10**6)
In [175]: %timeit faster(x, 3)
10 loops, best of 3: 83 ms per loop
Upvotes: 1
Reputation: 221674
Here's one approach leveraging broadcasting
with outer-subtraction
-
(np.abs(a[:,1,None] - a[:,1]) <= 10).sum(1)
With outer subtract builtin
and count_nonzero
for counting -
np.count_nonzero(np.abs(np.subtract.outer(a[:,1],a[:,1]))<=10,axis=1)
Sample run -
# Input array
In [23]: a
Out[23]:
array([[ 0, 57],
[ 7, 72],
[ 2, 51],
[ 8, 67],
[ 4, 42]])
# Get count
In [24]: count = (np.abs(a[:,1,None] - a[:,1]) <= 10).sum(1)
In [25]: count
Out[25]: array([3, 2, 3, 3, 2])
# Stack with input
In [26]: np.c_[a,count]
Out[26]:
array([[ 0, 57, 3],
[ 7, 72, 2],
[ 2, 51, 3],
[ 8, 67, 3],
[ 4, 42, 2]])
Alternatively with SciPy's cdist
-
In [53]: from scipy.spatial.distance import cdist
In [54]: (cdist(a[:,None,1],a[:,1,None], 'minkowski', p=2)<=10).sum(1)
Out[54]: array([3, 2, 3, 3, 2])
For million rows in the input, we might want to resort to a loopy one -
n = len(a)
count = np.empty(n, dtype=int)
for i in range(n):
count[i] = np.count_nonzero(np.abs(a[:,1]-a[i,1])<=10)
Upvotes: 2