Reputation: 4244
I have data arranged in the following form:
ID,DATE,STATUS
1,6/20/2011,A
1,1/14/2013,B
1,8/1/2016,C
2,3/1/2005,A
2,4/30/2005,B
2,6/30/2010,C
2,8/20/2010,D
I want to convert these transactions into an unbalanced panel with an annual frequency:
ID,YEAR,STATUS
1,2011,A
1,2012,A
1,2013,B
1,2014,B
1,2015,B
1,2016,C
2,2005,B
2,2006,B
2,2007,B
2,2008,B
2,2009,B
2,2010,D
So basically I want an annual series for each ID that spans the first through last date observed for that ID. The status in each year will be the last status observed in the year if there is more than one record for that year, or the last observed status if there is no date in that year.
This is a big dataset, so a good answer needs to use efficient methods provided by numpy/pandas.
Upvotes: 0
Views: 282
Reputation:
Here's one way:
import pandas as pd
df = pd.read_csv('file', parse_dates=['DATE'])
df = df.set_index('DATE').resample('A').ffill()
df['YEAR'] = df.index.year
df = df.sort_values(['ID', 'YEAR']).reset_index(drop=True)
df
Out:
ID STATUS YEAR
0 1 A 2011
1 1 A 2012
2 1 B 2013
3 1 B 2014
4 1 B 2015
5 1 C 2016
6 2 B 2005
7 2 B 2006
8 2 B 2007
9 2 B 2008
10 2 B 2009
11 2 D 2010
Upvotes: 2