dlm
dlm

Reputation: 4244

[Numpy/Pandas]How can I efficiently create a panel data set from transaction records?

I have data arranged in the following form:

ID,DATE,STATUS
1,6/20/2011,A
1,1/14/2013,B
1,8/1/2016,C
2,3/1/2005,A
2,4/30/2005,B
2,6/30/2010,C
2,8/20/2010,D

I want to convert these transactions into an unbalanced panel with an annual frequency:

ID,YEAR,STATUS
1,2011,A
1,2012,A
1,2013,B
1,2014,B
1,2015,B
1,2016,C
2,2005,B
2,2006,B
2,2007,B
2,2008,B
2,2009,B
2,2010,D

So basically I want an annual series for each ID that spans the first through last date observed for that ID. The status in each year will be the last status observed in the year if there is more than one record for that year, or the last observed status if there is no date in that year.

This is a big dataset, so a good answer needs to use efficient methods provided by numpy/pandas.

Upvotes: 0

Views: 282

Answers (1)

user2285236
user2285236

Reputation:

Here's one way:

import pandas as pd
df = pd.read_csv('file', parse_dates=['DATE'])
df = df.set_index('DATE').resample('A').ffill()
df['YEAR'] = df.index.year
df = df.sort_values(['ID', 'YEAR']).reset_index(drop=True)
df
Out: 
    ID STATUS  YEAR
0    1      A  2011
1    1      A  2012
2    1      B  2013
3    1      B  2014
4    1      B  2015
5    1      C  2016
6    2      B  2005
7    2      B  2006
8    2      B  2007
9    2      B  2008
10   2      B  2009
11   2      D  2010

Upvotes: 2

Related Questions