Reputation: 1799
Let's say you have a sorted array of numbers sorted_array
and a threshold threshold
. What is the fastest way to find the longest sub-array in which all the values are within the threshold? In other words, find indices i
and j
such that:
sorted_array[j] - sorted_array[i] <= threshold
j - i
is maximalIn case of a tie, return the pair with the smallest i
.
I already have a loop-based solution, which I will post as an answer, but I'm curious to see if there's a better way, or a way that can avoid the loop using a vector-capable language or library like NumPy, for example.
Example input and output:
>>> sorted_array = [0, 0.7, 1, 2, 2.5]
>>> longest_subarray_within_threshold(sorted_array, 0.2)
(0, 0)
>>> longest_subarray_within_threshold(sorted_array, 0.4)
(1, 2)
>>> longest_subarray_within_threshold(sorted_array, 0.8)
(0, 1)
>>> longest_subarray_within_threshold(sorted_array, 1)
(0, 2)
>>> longest_subarray_within_threshold(sorted_array, 1.9)
(1, 4)
>>> longest_subarray_within_threshold(sorted_array, 2)
(0, 3)
>>> longest_subarray_within_threshold(sorted_array, 2.6)
(0, 4)
Upvotes: 1
Views: 484
Reputation: 2196
Most likely, the OP's own answer is the best possible algorithm, as it is O(n). However, the pure-python overhead makes it very slow. However, this overhead can easily be reduced by compiling the algorithm using numba, with the current version (0.28.1 as of this writing), there is no need for any manual typing, simply decorating your function with @numba.njit()
is enough.
However, if you do not want to depend on numba, there is a numpy algorithm in O(n log n):
def algo_burnpanck(sorted_array,thresh):
limit = np.searchsorted(sorted_array,sorted_array+thresh,'right')
distance = limit - np.arange(limit.size)
best = np.argmax(distance)
return best, limit[best]-1
I did run a quick profiling on my own machine of the two previous answers (OP's and Divakar's), as well as my numpy algorithm and the numba version of the OP's algorithm.
thresh = 1
for n in [100, 10000]:
sorted_array = np.sort(np.random.randn(n,))
for f in [algo_user1475412,algo_Divakar,algo_burnpanck,algo_user1475412_numba]:
a,b = f(sorted_array, thresh)
d = b-a
diff = sorted_array[b]-sorted_array[a]
closestlonger = np.min(sorted_array[d+1:]-sorted_array[:-d-1])
assert sorted_array[b]-sorted_array[a]<=thresh
assert closestlonger>thresh
print('f=%s, n=%d thresh=%s:'%(f.__name__,n,thresh))#,d,a,b,diff,closestlonger)
%timeit f(sorted_array, thresh)
Here are the results:
f=algo_user1475412, n=100 thresh=1:
10000 loops, best of 3: 111 µs per loop
f=algo_Divakar, n=100 thresh=1:
10000 loops, best of 3: 74.6 µs per loop
f=algo_burnpanck, n=100 thresh=1:
100000 loops, best of 3: 9.38 µs per loop
f=algo_user1475412_numba, n=100 thresh=1:
1000000 loops, best of 3: 764 ns per loop
f=algo_user1475412, n=10000 thresh=1:
100 loops, best of 3: 12.1 ms per loop
f=algo_Divakar, n=10000 thresh=1:
1 loop, best of 3: 1.76 s per loop
f=algo_burnpanck, n=10000 thresh=1:
1000 loops, best of 3: 308 µs per loop
f=algo_user1475412_numba, n=10000 thresh=1:
10000 loops, best of 3: 82.9 µs per loop
At 100 numbers, O(n^2) solution using numpy just barely beats the O(n) python solution, but quickly after, the scaling makes that algorithm useless. The O(n log n) keeps up even at 10000 numbers, but the numba approach is unbeaten everywhere.
Upvotes: 2
Reputation: 221514
Here's a vectorized approach using broadcasting
-
def longest_thresh_subarray(sorted_array,thresh):
diffs = (sorted_array[:,None] - sorted_array)
r = np.arange(sorted_array.size)
valid_mask = r[:,None] > r
mask = (diffs <= thresh) & valid_mask
bestcolID = (mask).sum(0).argmax()
idx = np.nonzero(mask[:,bestcolID])[0]
if len(idx)==0:
out = (0,0)
else:
out = idx[0]-1, idx[-1]
return out
Sample runs -
In [137]: sorted_array = np.array([0, 0.7, 1, 2, 2.5])
In [138]: longest_thresh_subarray(sorted_array,0.2)
Out[138]: (0, 0)
In [139]: longest_thresh_subarray(sorted_array,0.4)
Out[139]: (1, 2)
In [140]: longest_thresh_subarray(sorted_array,0.8)
Out[140]: (0, 1)
In [141]: longest_thresh_subarray(sorted_array,1)
Out[141]: (0, 2)
In [142]: longest_thresh_subarray(sorted_array,1.9)
Out[142]: (1, 4)
In [143]: longest_thresh_subarray(sorted_array,2)
Out[143]: (0, 3)
In [144]: longest_thresh_subarray(sorted_array,2.6)
Out[144]: (0, 4)
Upvotes: 0
Reputation: 1799
Here's a simple loop-based solution in Python:
def longest_subarray_within_threshold(sorted_array, threshold):
result = (0, 0)
longest = 0
i = j = 0
end = len(sorted_array)
while i < end:
if j < end and sorted_array[j] - sorted_array[i] <= threshold:
current_distance = j - i
if current_distance > longest:
longest = current_distance
result = (i, j)
j += 1
else:
i += 1
return result
Upvotes: 2