Reputation: 1601
I have the following dataframe:
name from amount days
A 7/31/18 200 1
B 7/31/18 300 1
C 7/30/18 200 1
D 7/27/18 100 3
......
G 7/17/18 50 1
H 7/13/18 150 4
I'd like to expand it to this, where days does not equal 1:
name from amount days
A 7/31/18 200 1
B 7/31/18 300 1
C 7/30/18 200 1
D 7/29/18 100 3
D 7/28/18 100 3
D 7/27/18 100 3
......
G 7/17/18 50 1
H 7/16/18 150 4
H 7/15/18 150 4
H 7/14/18 150 4
H 7/13/18 150 4
If possible, I'd also like to add a column that can distinguish between the original data and expanded data (since I'm going to need to filter some dates eventually):
name from amount days original
A 7/31/18 200 1 1
B 7/31/18 300 1 1
C 7/30/18 200 1 1
D 7/29/18 100 3 0
D 7/28/18 100 3 0
D 7/27/18 100 3 1
......
G 7/17/18 50 1 1
H 7/16/18 150 4 0
H 7/15/18 150 4 0
H 7/14/18 150 4 0
H 7/13/18 150 4 1
Edit: To clarify the expansion: Days will tell you how many rows it needs to be expanded to. Alternatively, you can use the date above the original value as a boundary (the entry 7/27 where days=3 will stop at the above value where the date is 7/30. The data has constraints to make sure it never overlaps).
Upvotes: 2
Views: 511
Reputation: 294218
df['from'] = pd.to_datetime(df['from'])
pd.DataFrame([
(n, f, a, d, int(f == F))
for n, F, a, d in zip(*map(df.get, df))
for f in pd.date_range(F, periods=d)[::-1]
], columns=[*df.columns] + ['original'])
name from amount days original
0 A 2018-07-31 200 1 1
1 B 2018-07-31 300 1 1
2 C 2018-07-30 200 1 1
3 D 2018-07-29 100 3 0
4 D 2018-07-28 100 3 0
5 D 2018-07-27 100 3 1
6 G 2018-07-17 50 1 1
7 H 2018-07-16 150 4 0
8 H 2018-07-15 150 4 0
9 H 2018-07-14 150 4 0
10 H 2018-07-13 150 4 1
I edited my answer to use duplicated
instead of cum_count
. I got the idea from @Wen's post
def f(x):
return pd.date_range(
pd.to_datetime(x).min(),
periods=len(x)
).sort_values(ascending=False)
def g(d):
return d.groupby('name')['from'].transform(f)
def h(d):
return 1 - d.name.duplicated(keep='last')
df.loc[df.index.repeat(df.days)].assign(**{'from': g, 'original': h})
name from amount days original
0 A 2018-07-31 200 1 1
1 B 2018-07-31 300 1 1
2 C 2018-07-30 200 1 1
3 D 2018-07-29 100 3 0
3 D 2018-07-28 100 3 0
3 D 2018-07-27 100 3 1
4 G 2018-07-17 50 1 1
5 H 2018-07-16 150 4 0
5 H 2018-07-15 150 4 0
5 H 2018-07-14 150 4 0
5 H 2018-07-13 150 4 1
Upvotes: 4
Reputation: 323226
About two steps create the dataframe(reindex
) , and adjust the values (duplicated
)
newdf=df.reindex(df.index.repeat(df.days)) # create the df using reindex
adddate=pd.Series(np.concatenate(df.days.apply(np.arange).values),index=newdf.index)# create the timedelta to add
newdf['from']=pd.to_datetime(newdf['from'])+pd.to_timedelta(adddate,unit='d')# assign the value
newdf['original']=(~newdf.index.duplicated()).astype(int)
newdf
Out[240]:
name from amount days original
0 A 2018-07-31 200 1 1
1 B 2018-07-31 300 1 1
2 C 2018-07-30 200 1 1
3 D 2018-07-27 100 3 1
3 D 2018-07-28 100 3 0
3 D 2018-07-29 100 3 0
Upvotes: 4