Reputation: 33
I have a dataframe like this
df = pd.DataFrame({'grade': ['A','C','B'], 'year': [2018,2015,2017], 'label': [1,2,3]})
grade year label
0 A 2018 1
1 C 2015 2
2 B 2017 3
I want to expand the dataframe based on year column (the most recent year for each label). Basically, for each label, generate 4 more rows to cover recent 5 years in total.
Expected output:
print(df_expanded)
grade year label
0 A 2018 1
1 A 2017 1
2 A 2016 1
3 A 2015 1
4 A 2014 1
5 C 2015 2
6 C 2014 2
7 C 2013 2
8 C 2012 2
9 C 2011 2
10 B 2017 3
11 B 2016 3
12 B 2015 3
13 B 2014 3
14 B 2013 3
What I have tried:
for lab in df['label'].unique():
grp = df.loc[(df['label']==lab)]
yr = grp['year'].iloc[0]
df_year = pd.DataFrame({'year': list(reversed(range(yr-4,yr+1)))})
df_merged = pd.merge(grp, df_year, how='outer', left_on=['year'], right_on=['year'])
df_merged = df_merged.fillna(method='ffill')
df_expanded=pd.concat([df_expanded,df_merged],axis=0)
df_expanded = df_expanded.reset_index(drop=True)
df_expanded['label'] = df_expanded['label'].astype(int)
My "for loop" method works. However, it's running extremely slow on my actual dataset, which contains roughly 30000 label. I'm wondering there must be some better way to do so. Thanks a lot!
Upvotes: 3
Views: 93
Reputation: 59549
Series.repeat
Since label
is repeated the same # per original row, then subtract off the cumcount to get the years.
df = df.set_index(['grade', 'year'])['label'].repeat(5).reset_index()
df['year'] = df.year - df.groupby('grade').cumcount()
grade year label
0 A 2018 1
1 A 2017 1
2 A 2016 1
3 A 2015 1
4 A 2014 1
5 C 2015 2
6 C 2014 2
7 C 2013 2
8 C 2012 2
9 C 2011 2
10 B 2017 3
11 B 2016 3
12 B 2015 3
13 B 2014 3
14 B 2013 3
Upvotes: 1
Reputation: 42916
When you have multiple grade
values, we are safer using GroupBy
:
df = (
df.groupby('grade', sort=False)
.apply(lambda x: x.set_index('year')
.reindex(np.arange(x.year.max(), x.year.max()-5, step=-1)))
.ffill()
.reset_index(level=1)
.reset_index(drop=True)
)
year grade label
0 2018 A 1.0
1 2017 A 1.0
2 2016 A 1.0
3 2015 A 1.0
4 2014 A 1.0
5 2015 C 2.0
6 2014 C 2.0
7 2013 C 2.0
8 2012 C 2.0
9 2011 C 2.0
10 2017 B 3.0
11 2016 B 3.0
12 2015 B 3.0
13 2014 B 3.0
14 2013 B 3.0
Upvotes: 2
Reputation: 294278
pd.DataFrame(
[
(g, y, l) for g, Y, l in zip(*map(df.get, df))
for y in range(Y, Y - 5, -1)
],
columns=df.columns
)
grade year label
0 A 2018 1
1 A 2017 1
2 A 2016 1
3 A 2015 1
4 A 2014 1
5 C 2015 2
6 C 2014 2
7 C 2013 2
8 C 2012 2
9 C 2011 2
10 B 2017 3
11 B 2016 3
12 B 2015 3
13 B 2014 3
14 B 2013 3
explode
df.assign(year=[range(y, y - 5, -1) for y in df.year]).explode('year')
grade year label
0 A 2018 1
0 A 2017 1
0 A 2016 1
0 A 2015 1
0 A 2014 1
1 C 2015 2
1 C 2014 2
1 C 2013 2
1 C 2012 2
1 C 2011 2
2 B 2017 3
2 B 2016 3
2 B 2015 3
2 B 2014 3
2 B 2013 3
Upvotes: 3
Reputation: 150745
You can try:
(pd.concat(df.assign(year=df['year'].sub(i)) for i in range(5))
.sort_index()
.reset_index(drop=True)
)
Output:
grade year label
0 A 2018 1
1 A 2017 1
2 A 2016 1
3 A 2015 1
4 A 2014 1
5 C 2015 2
6 C 2014 2
7 C 2013 2
8 C 2012 2
9 C 2011 2
10 B 2017 3
11 B 2016 3
12 B 2015 3
13 B 2014 3
14 B 2013 3
Upvotes: 4