Selecting slices of a Pandas Series based on both index and value conditions

Question

I have a Pandas Series which contains acceleration timeseries data. My goal is to select slices of extreme force given some threshold. I was able to get part way with the following:

extremes = series.where(lambda force: abs(force - RESTING_FORCE) >= THRESHOLD, other=np.nan)

Now extremes contains all values which exceed the threshold and NaN for any that don't, maintaining the original index.

However, a secondary requirement is that nearby peaks should be merged into a single event. Visually, you can picture the three extremes on the left (two high, one low) being joined into one complete segment and the two peaks on the right being joined into another complete segment.

I've read through the entire Series reference but I'm having trouble finding methods to operate on my partial dataset. For example, if I had a method that returned an array of non-NaN index ranges, I would be able to sequentially compare each range and decide whether or not to fill in the space between with values from the original series (nearby) or leave them NaN (too far apart).

Perhaps I need to abandon the intermediate step and approach this from an entirely different angle? I'm new to Python so I'm having trouble getting very far with this. Any tips would be appreciated.

pansen · Accepted Answer

It actually wasn't so simple to come up with a vectorized solution without looping.

You'll probably need to go through the code step by step to see the actual outcome of each method but here is short sketch of the idea:

Solution outline

Identify all peaks via simple threshold filter
Get timestamps of peak values into a column and forward fill gaps in between in order to allow to compare current valid timestamp with previous valid timestamp
Do actual comparison via diff() to get time deltas and apply time delta comparison
Convert booleans to integers to use cummulative sum to create signal groups
Group by signals and get min and max timestamp values

Example data

Here is the code with a dummy example:

%matplotlib inline

import pandas as pd
import numpy as np

size = 200

# create some dummy data
ts = pd.date_range(start="2017-10-28", freq="d", periods=size)
values = np.cumsum(np.random.normal(size=size)) + np.sin(np.linspace(0, 100, size))
series = pd.Series(values, index=ts, name="force")
series.plot(figsize=(10, 5))

Solution code

# define thresholds
threshold_value = 6
threshold_time = pd.Timedelta(days=10)

# create data frame because we'll need helper columns
df = series.reset_index()

# get all initial peaks below or above threshold
mask = df["force"].abs().gt(threshold_value)

# create variable to store only timestamps of intial peaks
df.loc[mask, "ts_gap"] = df.loc[mask, "index"]

# create forward fill to enable comparison between current and next peak
df["ts_fill"] = df["ts_gap"].ffill()

# apply time delta comparison to filter only those within given time interval
df["within"] = df["ts_fill"].diff() < threshold_time

# convert boolean values into integers and 
# create cummulative sum which creates group of consecutive timestamps
df["signals"] = (~df["within"]).astype(int).cumsum()

# create dataframe containing start and end values
df_signal = df.dropna(subset=["ts_gap"])\
              .groupby("signals")["ts_gap"]\
              .agg(["min", "max"])

# show results
df_signal

>>>       min           max
signals         
10        2017-11-06    2017-11-27
11        2017-12-13    2018-01-22
12        2018-02-03    2018-02-23

Finally, show the plot:

series.plot(figsize=(10, 5))

for _, (idx_min, idx_max) in df_signal.iterrows():
    series[idx_min:idx_max].plot()

Result

As you can see in the plot, peaks greater an absolute value of 6 are merged into a single signal if their last and first timestamps are within a range of 10 days. The thresholds here are arbitrary just for illustration purpose. you can change them to whatever.

Selecting slices of a Pandas Series based on both index and value conditions

Answers (1)

Solution outline

Example data

Solution code

Result

Related Questions