Abhishek Thakur
Abhishek Thakur

Reputation: 16995

add rows in pandas dataframe based on date and value

I have a pandas dataframe like the following:

id, date, add_days
1, 2017-01-01, 3
2, 2017-03-05, 5
3, 2017-02-27, 3
.
.
.

I want to repeat the ids and increase date by given add_days:

id, date, add_days
1, 2017-01-01, 3
1, 2017-01-02, 3
1, 2017-01-03, 3
2, 2017-03-05, 5
2, 2017-03-06, 5
2, 2017-03-07, 5
2, 2017-03-08, 5
2, 2017-03-09, 5
3, 2017-02-27, 3
3, 2017-02-28, 3
3, 2017-03-01, 3
.
.
.

Is there a panda-oic way of doing this? I'm looking for an efficient solution since the initial dataframe can have millions of rows.

Upvotes: 3

Views: 1669

Answers (1)

jezrael
jezrael

Reputation: 862431

You can use melt with groupby and resample:

cols = df.columns

#add end date by timedelta, only substract one day
df['end'] = df.date + pd.to_timedelta(df.add_days.sub(1), unit='d')
print (df)
   id       date  add_days        end
0   1 2017-01-01         3 2017-01-03
1   2 2017-03-05         5 2017-03-09
2   3 2017-02-27         3 2017-03-01

df1 = pd.melt(
    df, ['id', 'add_days'],
    ['date', 'end'],
    value_name='date'
).drop('variable', 1).set_index('date')
print (df1)
            id  add_days
date                    
2017-01-01   1         3
2017-03-05   2         5
2017-02-27   3         3
2017-01-03   1         3
2017-03-09   2         5
2017-03-01   3         3
df2=df1.groupby('id').resample('D').ffill().reset_index(0, drop=True).reset_index()
#if order of columns is important
df2 = df2.reindex_axis(cols, axis=1)
print (df2)
    id       date  add_days
0    1 2017-01-01         3
1    1 2017-01-02         3
2    1 2017-01-03         3
3    2 2017-03-05         5
4    2 2017-03-06         5
5    2 2017-03-07         5
6    2 2017-03-08         5
7    2 2017-03-09         5
8    3 2017-02-27         3
9    3 2017-02-28         3
10   3 2017-03-01         3

Another solution with concat Series created by date_range and last join to original df:

df1 = pd.concat([pd.Series(r.Index, pd.date_range(r.date, r.end)) 
                                       for r in df.itertuples()]).reset_index()
df1.columns = ['date','idx']
print (df1)
         date  idx
0  2017-01-01    0
1  2017-01-02    0
2  2017-01-03    0
3  2017-03-05    1
4  2017-03-06    1
5  2017-03-07    1
6  2017-03-08    1
7  2017-03-09    1
8  2017-02-27    2
9  2017-02-28    2
10 2017-03-01    2

df2 = df1.set_index('idx').join(df[['id','add_days']]).reset_index(drop=True)
print (df2)
         date  id  add_days
0  2017-01-01   1         3
1  2017-01-02   1         3
2  2017-01-03   1         3
3  2017-03-05   2         5
4  2017-03-06   2         5
5  2017-03-07   2         5
6  2017-03-08   2         5
7  2017-03-09   2         5
8  2017-02-27   3         3
9  2017-02-28   3         3
10 2017-03-01   3         3

Upvotes: 3

Related Questions