Reputation: 11
How do I find the largest sequence in a timeseries.
For example I have a DataFrame
like this:
index Value
1-1-2012 10
1-2-2012 14
1-3-2012 15
1-4-2012 8
1-5-2012 7
1-6-2012 16
1-7-2012 17
1-8-2012 18
Now I want to get the longest sequence:
Here it would be the sequence from 1-6-2012
until 1-8-2012
with 3 entries.
Thanks Anja
Upvotes: 1
Views: 1542
Reputation: 803
This is a bit clunky but does the job. As you didn't specify the 'specific value' mentioned in the title, I choose 12.
import pandas as pd
time_indecies = pd.date_range(start='2012-01-01', end='2012-08-01', freq='MS')
data = [10, 14, 15, 8, 7, 16, 17, 18]
df = pd.DataFrame({'vals': data, 't_indices': time_indecies })
threshold = 12
df['tag'] = df.vals > threshold
# make another DF to hold info about each region
regs_above_thresh = pd.DataFrame()
# first row of consecutive region is a True preceded by a False in tags
regs_above_thresh['start_idx'] = \
df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]
# last row of consecutive region is a False preceded by a True
regs_above_thresh['end_idx'] = \
df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]
# how long is each region
regs_above_thresh['spans'] = \
[(spam[0] - spam[1] + 1) for spam in \
zip(regs_above_thresh['end_idx'], regs_above_thresh['start_idx'])]
# index of the region with the longest span
max_idx = regs_above_thresh['spans'].argmax()
# we can get the start and end points of longest region from the original dataframe
df.ix[regs_above_thresh.ix[max_idx][['start_idx', 'end_idx']].values]
The consecutive region cleverness is from behzad.nouri's solution here.
Upvotes: 2