user1475412
user1475412

Reputation: 1799

How to find the longest sub-array within a threshold?

Let's say you have a sorted array of numbers sorted_array and a threshold threshold. What is the fastest way to find the longest sub-array in which all the values are within the threshold? In other words, find indices i and j such that:

  1. sorted_array[j] - sorted_array[i] <= threshold
  2. j - i is maximal

In case of a tie, return the pair with the smallest i.

I already have a loop-based solution, which I will post as an answer, but I'm curious to see if there's a better way, or a way that can avoid the loop using a vector-capable language or library like NumPy, for example.

Example input and output:

>>> sorted_array = [0, 0.7, 1, 2, 2.5]
>>> longest_subarray_within_threshold(sorted_array, 0.2)
(0, 0)
>>> longest_subarray_within_threshold(sorted_array, 0.4)
(1, 2)
>>> longest_subarray_within_threshold(sorted_array, 0.8)
(0, 1)
>>> longest_subarray_within_threshold(sorted_array, 1)
(0, 2)
>>> longest_subarray_within_threshold(sorted_array, 1.9)
(1, 4)
>>> longest_subarray_within_threshold(sorted_array, 2)
(0, 3)
>>> longest_subarray_within_threshold(sorted_array, 2.6)
(0, 4)

Upvotes: 1

Views: 484

Answers (3)

burnpanck
burnpanck

Reputation: 2196

Most likely, the OP's own answer is the best possible algorithm, as it is O(n). However, the pure-python overhead makes it very slow. However, this overhead can easily be reduced by compiling the algorithm using numba, with the current version (0.28.1 as of this writing), there is no need for any manual typing, simply decorating your function with @numba.njit() is enough.

However, if you do not want to depend on numba, there is a numpy algorithm in O(n log n):

def algo_burnpanck(sorted_array,thresh):
    limit = np.searchsorted(sorted_array,sorted_array+thresh,'right')
    distance = limit - np.arange(limit.size)
    best = np.argmax(distance)
    return best, limit[best]-1

I did run a quick profiling on my own machine of the two previous answers (OP's and Divakar's), as well as my numpy algorithm and the numba version of the OP's algorithm.

thresh = 1
for n in [100, 10000]:
    sorted_array = np.sort(np.random.randn(n,))
    for f in [algo_user1475412,algo_Divakar,algo_burnpanck,algo_user1475412_numba]:
        a,b = f(sorted_array, thresh)
        d = b-a
        diff = sorted_array[b]-sorted_array[a]
        closestlonger = np.min(sorted_array[d+1:]-sorted_array[:-d-1])
        assert sorted_array[b]-sorted_array[a]<=thresh
        assert closestlonger>thresh
        print('f=%s, n=%d thresh=%s:'%(f.__name__,n,thresh))#,d,a,b,diff,closestlonger)
        %timeit f(sorted_array, thresh)

Here are the results:

f=algo_user1475412, n=100 thresh=1:
10000 loops, best of 3: 111 µs per loop
f=algo_Divakar, n=100 thresh=1:
10000 loops, best of 3: 74.6 µs per loop
f=algo_burnpanck, n=100 thresh=1:
100000 loops, best of 3: 9.38 µs per loop
f=algo_user1475412_numba, n=100 thresh=1:
1000000 loops, best of 3: 764 ns per loop
f=algo_user1475412, n=10000 thresh=1:
100 loops, best of 3: 12.1 ms per loop
f=algo_Divakar, n=10000 thresh=1:
1 loop, best of 3: 1.76 s per loop
f=algo_burnpanck, n=10000 thresh=1:
1000 loops, best of 3: 308 µs per loop
f=algo_user1475412_numba, n=10000 thresh=1:
10000 loops, best of 3: 82.9 µs per loop

At 100 numbers, O(n^2) solution using numpy just barely beats the O(n) python solution, but quickly after, the scaling makes that algorithm useless. The O(n log n) keeps up even at 10000 numbers, but the numba approach is unbeaten everywhere.

Upvotes: 2

Divakar
Divakar

Reputation: 221514

Here's a vectorized approach using broadcasting -

def longest_thresh_subarray(sorted_array,thresh):
    diffs = (sorted_array[:,None] - sorted_array)
    r = np.arange(sorted_array.size)
    valid_mask = r[:,None] > r
    mask = (diffs <= thresh) & valid_mask
    bestcolID = (mask).sum(0).argmax()
    idx = np.nonzero(mask[:,bestcolID])[0]
    if len(idx)==0:
        out = (0,0)
    else:
        out = idx[0]-1, idx[-1]
    return out

Sample runs -

In [137]: sorted_array = np.array([0, 0.7, 1, 2, 2.5])

In [138]: longest_thresh_subarray(sorted_array,0.2)
Out[138]: (0, 0)

In [139]: longest_thresh_subarray(sorted_array,0.4)
Out[139]: (1, 2)

In [140]: longest_thresh_subarray(sorted_array,0.8)
Out[140]: (0, 1)

In [141]: longest_thresh_subarray(sorted_array,1)
Out[141]: (0, 2)

In [142]: longest_thresh_subarray(sorted_array,1.9)
Out[142]: (1, 4)

In [143]: longest_thresh_subarray(sorted_array,2)
Out[143]: (0, 3)

In [144]: longest_thresh_subarray(sorted_array,2.6)
Out[144]: (0, 4)

Upvotes: 0

user1475412
user1475412

Reputation: 1799

Here's a simple loop-based solution in Python:

def longest_subarray_within_threshold(sorted_array, threshold):
    result = (0, 0)
    longest = 0
    i = j = 0
    end = len(sorted_array)
    while i < end:
        if j < end and sorted_array[j] - sorted_array[i] <= threshold:
            current_distance = j - i
            if current_distance > longest:
                longest = current_distance
                result = (i, j)
            j += 1
        else:
            i += 1
    return result

Upvotes: 2

Related Questions