Reputation: 3607
I have a dataframe with an index column and another column that marks whether or not an event occurred on that day with a 1 or 0.
If an event occurred it typically happened continuously for a prolonged period of time. They'll typically mark whether or not a recession occurred, so it'd likely be 60-180 straight days that would be marked with a 1 before going to 0 again.
What I need to do is find the dates that mark the beginning and end of each sequence of 1's.
Here's some quick sample code:
dates = pd.date_range(start='2010-01-01', end='2015-01-01')
nums = np.random.normal(50, 5, 1827)
df = pd.DataFrame(nums, index=dates, columns=['Nums'])
df['Recession'] = np.where((df.index.month == 3) | (df.index.month == 12), 1, 0)
With the example dataframe, the value 1 occurs for the months of March and December, so ideally I'd have a list that reads [2010-03-01, 2010-03-31, 2010-12-01, 2010-12-30, ......, 2015-12-01, 2015-12-30]
.
I know I could find these values by using a for-loop, but that seems inefficient. I tried using groupby
as well, but couldn't find anything that gave the results that I wanted.
Not sure if there's a pandas or numpy method to search an index for the appropriate conditions or not.
Upvotes: 0
Views: 1110
Reputation: 8302
Let's try this, using DataFrameGroupBy.idxmin
+ DataFrameGroupBy.idxmax
# group-by on month, year & aggregate on date
g = (
df.assign(day=df.index.day)
.groupby([df.index.month, df.index.year]).day
)
# create mask of max date & min date for each (month, year) combination
mask = df.index.isin(g.idxmin()) | df.index.isin(g.idxmax())
# apply previous mask with month filter..
df.loc[mask & (df.index.month.isin([3,12])), 'Recession'] = 1
print(df[df['Recession'] == 1])
Nums Recession
2010-03-01 45.698168 1.0
2010-03-31 47.969167 1.0
2010-12-01 49.388595 1.0
2010-12-31 46.689064 1.0
2011-03-01 50.120603 1.0
2011-03-31 58.379980 1.0
2011-12-01 53.745407 1.0
...
...
Upvotes: 1
Reputation: 23099
From what I understand you need to find the first value in a sequence? if so we can use groupby
and cumsum
to sum each consecutive group, and cumcount
to count each of the groups.
df["keyGroup"] = (
df.groupby(df["Recession"].ne(df["Recession"].shift()).cumsum()).cumcount() + 1
)
df[df['keyGroup'].eq(1)]
Nums Recession keyGroup
2010-01-01 51.944742 0 1
2010-03-01 54.809271 1 1
2010-04-01 52.632831 0 1
2010-12-01 55.863695 1 1
2011-01-01 52.944778 0 1
2011-03-01 58.164943 1 1
2011-04-01 49.590640 0 1
2011-12-01 47.884919 1 1
2012-01-01 44.128065 0 1
2012-03-01 54.846231 1 1
2012-04-01 51.312064 0 1
2012-12-01 46.091171 1 1
2013-01-01 49.287102 0 1
2013-03-01 54.727874 1 1
2013-04-01 53.163730 0 1
2013-12-01 42.373602 1 1
2014-01-01 43.822791 0 1
2014-03-01 51.203125 1 1
2014-04-01 54.322415 0 1
2014-12-01 44.052536 1 1
2015-01-01 53.438015 0 1
you can call .index
to get the values in a list.
df[df['keyGroup'].eq(1)].index
DatetimeIndex(['2010-01-01', '2010-03-01', '2010-04-01', '2010-12-01',
'2011-01-01', '2011-03-01', '2011-04-01', '2011-12-01',
'2012-01-01', '2012-03-01', '2012-04-01', '2012-12-01',
'2013-01-01', '2013-03-01', '2013-04-01', '2013-12-01',
'2014-01-01', '2014-03-01', '2014-04-01', '2014-12-01',
'2015-01-01'],
dtype='datetime64[ns]', name='date', freq=None)
Upvotes: 0
Reputation: 685
I would use diff to find the periods, the diff enables to find when it switches from one state to another, so split the indices found in two parts, the starts and ends.
Depending whether the data starts with a recession or not:
locs = (df.Recession.diff().fillna(0)!=0).values.nonzero()[0]
if df.Recession.iloc[0]==0:
start = df.index[locs[::2]]
end = df.index[locs[1::2]-1]
else:
start = df.index[locs[::2]-1]
end = df.index[locs[1::2]]
If the data started with a recession already, up to you if you want to include the first date as a start or not, the above does not include it.
Upvotes: 0