hide on bush
hide on bush

Reputation: 33

How to expand/add rows by year?

I have a dataframe like this

df = pd.DataFrame({'grade': ['A','C','B'], 'year': [2018,2015,2017], 'label': [1,2,3]})

  grade  year  label
0     A  2018      1
1     C  2015      2
2     B  2017      3

I want to expand the dataframe based on year column (the most recent year for each label). Basically, for each label, generate 4 more rows to cover recent 5 years in total.

Expected output:

print(df_expanded)

   grade  year  label
0      A  2018      1
1      A  2017      1
2      A  2016      1
3      A  2015      1
4      A  2014      1
5      C  2015      2
6      C  2014      2
7      C  2013      2
8      C  2012      2
9      C  2011      2
10     B  2017      3
11     B  2016      3
12     B  2015      3
13     B  2014      3
14     B  2013      3

What I have tried:

for lab in df['label'].unique():
    grp = df.loc[(df['label']==lab)]   
    yr = grp['year'].iloc[0]
    df_year = pd.DataFrame({'year': list(reversed(range(yr-4,yr+1)))})
    df_merged = pd.merge(grp, df_year, how='outer', left_on=['year'], right_on=['year'])
    df_merged = df_merged.fillna(method='ffill')
    df_expanded=pd.concat([df_expanded,df_merged],axis=0)

df_expanded = df_expanded.reset_index(drop=True)
df_expanded['label'] = df_expanded['label'].astype(int)

My "for loop" method works. However, it's running extremely slow on my actual dataset, which contains roughly 30000 label. I'm wondering there must be some better way to do so. Thanks a lot!

Upvotes: 3

Views: 93

Answers (4)

ALollz
ALollz

Reputation: 59549

Series.repeat

Since label is repeated the same # per original row, then subtract off the cumcount to get the years.

df = df.set_index(['grade', 'year'])['label'].repeat(5).reset_index()
df['year'] = df.year - df.groupby('grade').cumcount()

   grade  year  label
0      A  2018      1
1      A  2017      1
2      A  2016      1
3      A  2015      1
4      A  2014      1
5      C  2015      2
6      C  2014      2
7      C  2013      2
8      C  2012      2
9      C  2011      2
10     B  2017      3
11     B  2016      3
12     B  2015      3
13     B  2014      3
14     B  2013      3

Upvotes: 1

Erfan
Erfan

Reputation: 42916

When you have multiple grade values, we are safer using GroupBy:

df = (
    df.groupby('grade', sort=False)
       .apply(lambda x: x.set_index('year')
                         .reindex(np.arange(x.year.max(), x.year.max()-5, step=-1)))
       .ffill()
       .reset_index(level=1)
       .reset_index(drop=True)
)

    year grade  label
0   2018  A     1.0  
1   2017  A     1.0  
2   2016  A     1.0  
3   2015  A     1.0  
4   2014  A     1.0  
5   2015  C     2.0  
6   2014  C     2.0  
7   2013  C     2.0  
8   2012  C     2.0  
9   2011  C     2.0  
10  2017  B     3.0  
11  2016  B     3.0  
12  2015  B     3.0  
13  2014  B     3.0  
14  2013  B     3.0  

Upvotes: 2

piRSquared
piRSquared

Reputation: 294278

comprehension

pd.DataFrame(
    [
        (g, y, l) for g, Y, l in zip(*map(df.get, df))
                  for y in range(Y, Y - 5, -1)
    ],
    columns=df.columns
)

   grade  year  label
0      A  2018      1
1      A  2017      1
2      A  2016      1
3      A  2015      1
4      A  2014      1
5      C  2015      2
6      C  2014      2
7      C  2013      2
8      C  2012      2
9      C  2011      2
10     B  2017      3
11     B  2016      3
12     B  2015      3
13     B  2014      3
14     B  2013      3

explode

df.assign(year=[range(y, y - 5, -1) for y in df.year]).explode('year')

  grade  year  label
0     A  2018      1
0     A  2017      1
0     A  2016      1
0     A  2015      1
0     A  2014      1
1     C  2015      2
1     C  2014      2
1     C  2013      2
1     C  2012      2
1     C  2011      2
2     B  2017      3
2     B  2016      3
2     B  2015      3
2     B  2014      3
2     B  2013      3

Upvotes: 3

Quang Hoang
Quang Hoang

Reputation: 150745

You can try:

(pd.concat(df.assign(year=df['year'].sub(i)) for i in range(5))
   .sort_index()
   .reset_index(drop=True)
)

Output:

   grade  year  label
0      A  2018      1
1      A  2017      1
2      A  2016      1
3      A  2015      1
4      A  2014      1
5      C  2015      2
6      C  2014      2
7      C  2013      2
8      C  2012      2
9      C  2011      2
10     B  2017      3
11     B  2016      3
12     B  2015      3
13     B  2014      3
14     B  2013      3

Upvotes: 4

Related Questions