Alex Shmakov
Alex Shmakov

Reputation: 125

Quickly Find Non-Zero Intervals

I am writing an algorithm to determine the intervals of the "mountains" on a density plot. The plot is taken from the depths from a Kinect if anyone is interested. Here is a quick visual example of what this algorithm finds: (with the small mountains removed):

Example image

My current algorithm:

def find_peak_intervals(data):
    previous = 0
    peak = False
    ranges = []
    begin_range = 0
    end_range = 0

    for current in xrange(len(data)):
        if (not peak) and ((data[current] - data[previous]) > 0):
            peak = True
            begin_range = current

        if peak and (data[current] == 0):
            peak = False
            end_range = current
            ranges.append((begin_range, end_range))

        previous = current

    return np.array(ranges)

The function works but it takes nearly 3 milliseconds on my laptop, and I need to be able to run my entire program at at least 30 frames per second. This function is rather ugly and I have to run it 3 times per frame for my program, so I would like any hints as to how to simplify and optimize this function (maybe something from numpy or scipy that I missed).

Upvotes: 0

Views: 1164

Answers (1)

WGS
WGS

Reputation: 14179

Assuming a pandas dataframe like so:

    Value
0       0
1       3
2       2
3       2
4       1
5       2
6       3
7       0
8       1
9       3
10      0
11      0
12      0
13      1
14      0
15      3
16      2
17      3
18      1
19      0

You can get the contiguous non-zero ranges by using df["Value"].shift(x) where x could either be 1 or -1 so you can check if it's bounded by zeroes. Once you get the boundaries, you can just store their index pairs and use them later on when filtering the data.

The following code is based on the excellent answer here by @behzad.nouri.

import pandas as pd

df = pd.read_csv("data.csv")
# Or you can use df = pd.DataFrame.from_dict({'Value': {0: 0, 1: 3, 2: 2, 3: 2, 4: 1, 5: 2, 6: 3, 7: 0, 8: 1, 9: 3, 10: 0, 11: 0, 12: 0, 13: 1, 14: 0, 15: 3, 16: 2, 17: 3, 18: 1, 19: 0}})
# --
# from https://stackoverflow.com/questions/24281936
# credits to @behzad.nouri
df['tag'] = df['Value'] > 0
fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]
lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]
pr = [(i, j) for i, j in zip(fst, lst)]
# --

for i, j in pr:
    print df.loc[i:j, "Value"]

This gives the result:

1    3
2    2
3    2
4    1
5    2
6    3
Name: Value, dtype: int64
8    1
9    3
Name: Value, dtype: int64
13    1
Name: Value, dtype: int64
15    3
16    2
17    3
18    1
Name: Value, dtype: int64

Timing it in IPython gives the following:

%timeit find_peak_intervals(df)
1000 loops, best of 3: 1.49 ms per loop

This is not too far from your attempt speed-wise. An alternative is to use convert the pandas series to a numpy array and operate from there. Let's take another excellent answer, this one by @Warren Weckesser, and modify it to suit your needs. Let's time it as well.

In [22]: np_arr = np.array(df["Value"])

In [23]: def greater_than_zero(a):
    ...:     isntzero = np.concatenate(([0], np.greater(a, 0).view(np.int8), [0]))
    ...:     absdiff = np.abs(np.diff(isntzero))
    ...:     ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
    ...:     return ranges

In [24]: %timeit greater_than_zero(np_arr)
100000 loops, best of 3: 17.1 µs per loop

Not so bad at 17.1 microseconds, and it gives the same ranges as well.

[1 7] # Basically same as indices 1-6 in pandas.
[ 8 10] # 8, 9
[13 14] # 13, 13
[15 19] # 15, 18

Upvotes: 3

Related Questions