Reputation: 18109
I have a Pandas Series
which contains acceleration timeseries data. My goal is to select slices of extreme force given some threshold. I was able to get part way with the following:
extremes = series.where(lambda force: abs(force - RESTING_FORCE) >= THRESHOLD, other=np.nan)
Now extremes
contains all values which exceed the threshold and NaN for any that don't, maintaining the original index.
However, a secondary requirement is that nearby peaks should be merged into a single event. Visually, you can picture the three extremes on the left (two high, one low) being joined into one complete segment and the two peaks on the right being joined into another complete segment.
I've read through the entire Series
reference but I'm having trouble finding methods to operate on my partial dataset. For example, if I had a method that returned an array of non-NaN index ranges, I would be able to sequentially compare each range and decide whether or not to fill in the space between with values from the original series (nearby) or leave them NaN (too far apart).
Perhaps I need to abandon the intermediate step and approach this from an entirely different angle? I'm new to Python so I'm having trouble getting very far with this. Any tips would be appreciated.
Upvotes: 2
Views: 828
Reputation: 6663
It actually wasn't so simple to come up with a vectorized solution without looping.
You'll probably need to go through the code step by step to see the actual outcome of each method but here is short sketch of the idea:
diff()
to get time deltas and apply time delta comparisonHere is the code with a dummy example:
%matplotlib inline
import pandas as pd
import numpy as np
size = 200
# create some dummy data
ts = pd.date_range(start="2017-10-28", freq="d", periods=size)
values = np.cumsum(np.random.normal(size=size)) + np.sin(np.linspace(0, 100, size))
series = pd.Series(values, index=ts, name="force")
series.plot(figsize=(10, 5))
# define thresholds
threshold_value = 6
threshold_time = pd.Timedelta(days=10)
# create data frame because we'll need helper columns
df = series.reset_index()
# get all initial peaks below or above threshold
mask = df["force"].abs().gt(threshold_value)
# create variable to store only timestamps of intial peaks
df.loc[mask, "ts_gap"] = df.loc[mask, "index"]
# create forward fill to enable comparison between current and next peak
df["ts_fill"] = df["ts_gap"].ffill()
# apply time delta comparison to filter only those within given time interval
df["within"] = df["ts_fill"].diff() < threshold_time
# convert boolean values into integers and
# create cummulative sum which creates group of consecutive timestamps
df["signals"] = (~df["within"]).astype(int).cumsum()
# create dataframe containing start and end values
df_signal = df.dropna(subset=["ts_gap"])\
.groupby("signals")["ts_gap"]\
.agg(["min", "max"])
# show results
df_signal
>>> min max
signals
10 2017-11-06 2017-11-27
11 2017-12-13 2018-01-22
12 2018-02-03 2018-02-23
Finally, show the plot:
series.plot(figsize=(10, 5))
for _, (idx_min, idx_max) in df_signal.iterrows():
series[idx_min:idx_max].plot()
As you can see in the plot, peaks greater an absolute value of 6 are merged into a single signal if their last and first timestamps are within a range of 10 days. The thresholds here are arbitrary just for illustration purpose. you can change them to whatever.
Upvotes: 3