Reputation: 1
I have an unbalanced Pandas MultiIndex DataFrame where each row stores a firm-year
observation. Sample period (variable year
) ranges from 2013 to 2017. The dataset includes variable event
, which is set to 1
if an event happens in a given year
.
Sample dataset:
#Create dataset
import pandas as pd
df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5],
'year' : [2013,2014,2015,2016,2017,2014,2015,2016,2017,
2016,2017,2013,2014,2015,2014,2015,2016,2017],
'event' : [1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1]})
df.set_index(['id', 'year'], inplace = True)
df.sort_index(inplace = True)
I would like to create a new column status
based on existing column event
as follows: whenever the event happens for the first time in column event
the value of status
column should change from 0
to 1
for all subsequent years (including the year the event happens).
DataFrame with expected variable status
:
event status
id year
1 2013 1 1
2014 0 1
2015 0 1
2016 0 1
2017 0 1
2 2014 0 0
2015 0 0
2016 1 1
2017 0 1
3 2016 1 1
2017 0 1
4 2013 0 0
2014 1 1
2015 0 1
5 2014 0 0
2015 0 0
2016 0 0
2017 1 1
I haven't found any useful solutions so far, so any advice would be much appreciated. Thanks!
Upvotes: 0
Views: 776
Reputation: 1249
Basic answer with passages explained:
import pandas as pd
df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5],
'year' : [2013,2014,2015,2016,2017,2014,2015,2016,2017,
2016,2017,2013,2014,2015,2014,2015,2016,2017],
'event' : [1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1]})
# extract unique IDs as list
ids = list(set(df["id"]))
# initialize a list to keep the results
list_event_years =[]
#open a loop on IDs
for id in ids :
# set happened to 0
event_happened = 0
# open a loop on DF pertaining to the actual ID
for index, row in df[df["id"] == id].iterrows() :
# if event happened set the variable to 1
if row["event"] == 1 :
event_happened = 1
# add the var to the list of results
list_event_years.append(event_happened)
# add the list of results as DF column
df["event-happened"] = list_event_years
### OUTPUT
>>> df
id year event event-year
0 1 2013 1 1
1 1 2014 0 1
2 1 2015 0 1
3 1 2016 0 1
4 1 2017 0 1
5 2 2014 0 0
6 2 2015 0 0
7 2 2016 1 1
8 2 2017 0 1
9 3 2016 1 1
10 3 2017 0 1
11 4 2013 0 0
12 4 2014 1 1
13 4 2015 0 1
14 5 2014 0 0
15 5 2015 0 0
16 5 2016 0 0
17 5 2017 1 1
and if you need them to be indexed as in the example do it:
df.set_index(['id', 'year'], inplace = True)
df.sort_index(inplace = True)
### OUTPUT
>>> df
event event-year
id year
1 2013 1 1
2014 0 1
2015 0 1
2016 0 1
2017 0 1
2 2014 0 0
2015 0 0
2016 1 1
2017 0 1
3 2016 1 1
2017 0 1
4 2013 0 0
2014 1 1
2015 0 1
5 2014 0 0
2015 0 0
2016 0 0
2017 1 1
Upvotes: 0
Reputation: 42946
We can groupby
on first level of your index (id) and then mark all the rows which are eq
to one. Then use cumsum
which also converts True
to 1
and False
to 0
:
df['status'] = df.groupby(level=0).apply(lambda x: x.eq(1).cumsum())
Output
event status
id year
1 2013 1 1
2014 0 1
2015 0 1
2016 0 1
2017 0 1
2 2014 0 0
2015 0 0
2016 1 1
2017 0 1
3 2016 1 1
2017 0 1
4 2013 0 0
2014 1 1
2015 0 1
5 2014 0 0
2015 0 0
2016 0 0
2017 1 1
Upvotes: 3
Reputation: 2757
Key is to use cumsum
under groupby
df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5],
'year' : [2013,2014,2015,2016,2017,2014,2015,2016,2017,
2016,2017,2013,2014,2015,2014,2015,2016,2017],
'event' : [1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1]})
(df.assign(status = lambda x: x.event.eq(1).mul(1).groupby(x['id']).cumsum())
.set_index(['id','year']))
Output
event status
id year
1 2013 1 1
2014 0 1
2015 0 1
2016 0 1
2017 0 1
2 2014 0 0
2015 0 0
2016 1 1
2017 0 1
3 2016 1 1
2017 0 1
4 2013 0 0
2014 1 1
2015 0 1
5 2014 0 0
2015 0 0
2016 0 0
2017 1 1
Upvotes: 0