Reputation: 125
I am writing an algorithm to determine the intervals of the "mountains" on a density plot. The plot is taken from the depths from a Kinect if anyone is interested. Here is a quick visual example of what this algorithm finds: (with the small mountains removed):
My current algorithm:
def find_peak_intervals(data):
previous = 0
peak = False
ranges = []
begin_range = 0
end_range = 0
for current in xrange(len(data)):
if (not peak) and ((data[current] - data[previous]) > 0):
peak = True
begin_range = current
if peak and (data[current] == 0):
peak = False
end_range = current
ranges.append((begin_range, end_range))
previous = current
return np.array(ranges)
The function works but it takes nearly 3 milliseconds on my laptop, and I need to be able to run my entire program at at least 30 frames per second. This function is rather ugly and I have to run it 3 times per frame for my program, so I would like any hints as to how to simplify and optimize this function (maybe something from numpy or scipy that I missed).
Upvotes: 0
Views: 1164
Reputation: 14179
Assuming a pandas
dataframe like so:
Value
0 0
1 3
2 2
3 2
4 1
5 2
6 3
7 0
8 1
9 3
10 0
11 0
12 0
13 1
14 0
15 3
16 2
17 3
18 1
19 0
You can get the contiguous non-zero ranges by using df["Value"].shift(x)
where x
could either be 1
or -1
so you can check if it's bounded by zeroes. Once you get the boundaries, you can just store their index pairs and use them later on when filtering the data.
The following code is based on the excellent answer here by @behzad.nouri.
import pandas as pd
df = pd.read_csv("data.csv")
# Or you can use df = pd.DataFrame.from_dict({'Value': {0: 0, 1: 3, 2: 2, 3: 2, 4: 1, 5: 2, 6: 3, 7: 0, 8: 1, 9: 3, 10: 0, 11: 0, 12: 0, 13: 1, 14: 0, 15: 3, 16: 2, 17: 3, 18: 1, 19: 0}})
# --
# from https://stackoverflow.com/questions/24281936
# credits to @behzad.nouri
df['tag'] = df['Value'] > 0
fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]
lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]
pr = [(i, j) for i, j in zip(fst, lst)]
# --
for i, j in pr:
print df.loc[i:j, "Value"]
This gives the result:
1 3
2 2
3 2
4 1
5 2
6 3
Name: Value, dtype: int64
8 1
9 3
Name: Value, dtype: int64
13 1
Name: Value, dtype: int64
15 3
16 2
17 3
18 1
Name: Value, dtype: int64
Timing it in IPython gives the following:
%timeit find_peak_intervals(df)
1000 loops, best of 3: 1.49 ms per loop
This is not too far from your attempt speed-wise. An alternative is to use convert the pandas
series to a numpy
array and operate from there. Let's take another excellent answer, this one by @Warren Weckesser, and modify it to suit your needs. Let's time it as well.
In [22]: np_arr = np.array(df["Value"])
In [23]: def greater_than_zero(a):
...: isntzero = np.concatenate(([0], np.greater(a, 0).view(np.int8), [0]))
...: absdiff = np.abs(np.diff(isntzero))
...: ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
...: return ranges
In [24]: %timeit greater_than_zero(np_arr)
100000 loops, best of 3: 17.1 µs per loop
Not so bad at 17.1 microseconds, and it gives the same ranges as well.
[1 7] # Basically same as indices 1-6 in pandas.
[ 8 10] # 8, 9
[13 14] # 13, 13
[15 19] # 15, 18
Upvotes: 3