Jonathan Bechtel
Jonathan Bechtel

Reputation: 3607

Finding the beginning and end dates of when a sequence of values occurs in Pandas

I have a dataframe with an index column and another column that marks whether or not an event occurred on that day with a 1 or 0.

If an event occurred it typically happened continuously for a prolonged period of time. They'll typically mark whether or not a recession occurred, so it'd likely be 60-180 straight days that would be marked with a 1 before going to 0 again.

What I need to do is find the dates that mark the beginning and end of each sequence of 1's.

Here's some quick sample code:

dates = pd.date_range(start='2010-01-01', end='2015-01-01')
nums = np.random.normal(50, 5, 1827)
df = pd.DataFrame(nums, index=dates, columns=['Nums'])
df['Recession'] = np.where((df.index.month == 3) | (df.index.month == 12), 1, 0)

With the example dataframe, the value 1 occurs for the months of March and December, so ideally I'd have a list that reads [2010-03-01, 2010-03-31, 2010-12-01, 2010-12-30, ......, 2015-12-01, 2015-12-30].

I know I could find these values by using a for-loop, but that seems inefficient. I tried using groupby as well, but couldn't find anything that gave the results that I wanted.

Not sure if there's a pandas or numpy method to search an index for the appropriate conditions or not.

Upvotes: 0

Views: 1110

Answers (3)

sushanth
sushanth

Reputation: 8302

Let's try this, using DataFrameGroupBy.idxmin + DataFrameGroupBy.idxmax

# group-by on month, year & aggregate on date
g = (
    df.assign(day=df.index.day)
        .groupby([df.index.month, df.index.year]).day
)

# create mask of max date & min date for each (month, year) combination
mask = df.index.isin(g.idxmin()) | df.index.isin(g.idxmax())

# apply previous mask with month filter..
df.loc[mask & (df.index.month.isin([3,12])), 'Recession'] = 1

print(df[df['Recession'] == 1])

                 Nums  Recession
2010-03-01  45.698168        1.0
2010-03-31  47.969167        1.0
2010-12-01  49.388595        1.0
2010-12-31  46.689064        1.0
2011-03-01  50.120603        1.0
2011-03-31  58.379980        1.0
2011-12-01  53.745407        1.0
...
...

Upvotes: 1

Umar.H
Umar.H

Reputation: 23099

From what I understand you need to find the first value in a sequence? if so we can use groupby and cumsum to sum each consecutive group, and cumcount to count each of the groups.

df["keyGroup"] = (
    df.groupby(df["Recession"].ne(df["Recession"].shift()).cumsum()).cumcount() + 1
)

df[df['keyGroup'].eq(1)]

                 Nums  Recession  keyGroup
2010-01-01  51.944742          0         1
2010-03-01  54.809271          1         1
2010-04-01  52.632831          0         1
2010-12-01  55.863695          1         1
2011-01-01  52.944778          0         1
2011-03-01  58.164943          1         1
2011-04-01  49.590640          0         1
2011-12-01  47.884919          1         1
2012-01-01  44.128065          0         1
2012-03-01  54.846231          1         1
2012-04-01  51.312064          0         1
2012-12-01  46.091171          1         1
2013-01-01  49.287102          0         1
2013-03-01  54.727874          1         1
2013-04-01  53.163730          0         1
2013-12-01  42.373602          1         1
2014-01-01  43.822791          0         1
2014-03-01  51.203125          1         1
2014-04-01  54.322415          0         1
2014-12-01  44.052536          1         1
2015-01-01  53.438015          0         1

you can call .index to get the values in a list.

df[df['keyGroup'].eq(1)].index


DatetimeIndex(['2010-01-01', '2010-03-01', '2010-04-01', '2010-12-01',
               '2011-01-01', '2011-03-01', '2011-04-01', '2011-12-01',
               '2012-01-01', '2012-03-01', '2012-04-01', '2012-12-01',
               '2013-01-01', '2013-03-01', '2013-04-01', '2013-12-01',
               '2014-01-01', '2014-03-01', '2014-04-01', '2014-12-01',
               '2015-01-01'],
              dtype='datetime64[ns]', name='date', freq=None)

Upvotes: 0

EddyG
EddyG

Reputation: 685

I would use diff to find the periods, the diff enables to find when it switches from one state to another, so split the indices found in two parts, the starts and ends.

Depending whether the data starts with a recession or not:

locs = (df.Recession.diff().fillna(0)!=0).values.nonzero()[0]
if df.Recession.iloc[0]==0:
    start = df.index[locs[::2]]
    end = df.index[locs[1::2]-1]
else:
    start = df.index[locs[::2]-1]
    end = df.index[locs[1::2]]

If the data started with a recession already, up to you if you want to include the first date as a start or not, the above does not include it.

Upvotes: 0

Related Questions