Reputation: 453
I have this pandas dataset:
date | a | b | c | d | col1 | col2 |
---|---|---|---|---|---|---|
01-12-20 | 0 | 0 | 0 | 0 | 1 | 0.2 |
02-12-20 | 0 | 0 | 0 | 0 | 2 | 0.4 |
03-12-20 | 0 | 0 | 0 | 0 | 5 | 5 |
03-12-20 | 1 | 1 | 0 | 0 | 6 | 1 |
03-12-20 | 0 | 1 | 0 | 0 | 5 | 1 |
05-12-20 | 0 | 0 | 0 | 0 | 2 | 0.3 |
07-12-20 | 1 | 0 | 1 | 0 | 3 | 6 |
08-12-20 | 0 | 1 | 0 | 1 | 8 | 7 |
08-12-20 | 0 | 0 | 1 | 0 | 7 | 1 |
10-12-20 | 0 | 0 | 0 | 0 | 4 | 1 |
11-12-20 | 0 | 0 | 0 | 1 | 3 | 0 |
and the following list of time intervals
from datetime import date
timeintervals = [date(2020, 12, 1), date(2020, 12, 3), date(2020, 12, 7), date(2020, 12, 11), date(2020, 12, 21)]
groupcol = ["a","b","c","d"]
I need to obtain a new dataset where:
timeintervals[i] <= date < timeintervals[i+1]
and proceed recursively;The result should be this:
date | abcd | col1 | col2 |
---|---|---|---|
02-12-20 | a | 0 | 0 |
02-12-20 | b | 0 | 0 |
02-12-20 | c | 0 | 0 |
02-12-20 | d | 0 | 0 |
06-12-20 | a | 6 | 1 |
06-12-20 | b | 5.5 | 1 |
06-12-20 | c | 0 | 0 |
06-12-20 | d | 0 | 0 |
10-12-20 | a | 3 | 6 |
10-12-20 | b | 8 | 7 |
10-12-20 | c | 5 | 3.5 |
10-12-20 | d | 8 | 7 |
20-12-20 | a | 0 | 0 |
20-12-20 | b | 0 | 0 |
20-12-20 | c | 0 | 0 |
20-12-20 | d | 3 | 0 |
I am having trouble finding an efficient solution for this.
Upvotes: 1
Views: 567
Reputation: 8219
Let us do it in steps
First we add a column that specifies which group we allocate each date to
df['dgr'] = df['date'].apply(lambda d: [d.date()<l for l in timeintervals].index(True)-1)
df
to obtain
date a b c d col1 col2 dgr
-- ------------------- --- --- --- --- ------ ------ -----
0 2020-12-01 00:00:00 0 0 0 0 1 0.2 0
1 2020-12-02 00:00:00 0 0 0 0 2 0.4 0
2 2020-12-03 00:00:00 0 0 0 0 5 5 1
3 2020-12-03 00:00:00 1 1 0 0 6 1 1
4 2020-12-03 00:00:00 0 1 0 0 5 1 1
5 2020-12-05 00:00:00 0 0 0 0 2 0.3 1
6 2020-12-07 00:00:00 1 0 1 0 3 6 2
7 2020-12-08 00:00:00 0 1 0 1 8 7 2
8 2020-12-08 00:00:00 0 0 1 0 7 1 2
9 2020-12-10 00:00:00 0 0 0 0 4 1 2
10 2020-12-11 00:00:00 0 0 0 1 3 0 3
The following command prepares the 'skeleton' of the final result -- basically unrolls a,b,c,d into a column and keeps tracks of 0,1s in the original df
df2 = (df[groupcol]
.stack()
.reset_index(level = 1)
.rename(columns = {'level_1' : 'abcd',0:'incl'})
)
df2
to get
abcd incl
-- ------ ------
0 a 0
0 b 0
0 c 0
0 d 0
1 a 0
1 b 0
1 c 0
1 d 0
2 a 0
2 b 0
2 c 0
2 d 0
3 a 1
3 b 1
3 c 0
3 d 0
4 a 0
4 b 1
4 c 0
4 d 0
5 a 0
5 b 0
5 c 0
5 d 0
6 a 1
6 b 0
6 c 1
6 d 0
7 a 0
7 b 1
7 c 0
7 d 1
8 a 0
8 b 0
8 c 1
8 d 0
9 a 0
9 b 0
9 c 0
9 d 0
10 a 0
10 b 0
10 c 0
10 d 1
Now we can join df2 with df to fill in the date,col1, col2 details
df2 = df2.join(df[['date','col1','col2','dgr']])
to get
abcd incl date col1 col2 dgr
-- ------ ------ ------------------- ------ ------ -----
0 a 0 2020-12-01 00:00:00 1 0.2 0
0 b 0 2020-12-01 00:00:00 1 0.2 0
0 c 0 2020-12-01 00:00:00 1 0.2 0
0 d 0 2020-12-01 00:00:00 1 0.2 0
1 a 0 2020-12-02 00:00:00 2 0.4 0
1 b 0 2020-12-02 00:00:00 2 0.4 0
1 c 0 2020-12-02 00:00:00 2 0.4 0
1 d 0 2020-12-02 00:00:00 2 0.4 0
2 a 0 2020-12-03 00:00:00 5 5 1
2 b 0 2020-12-03 00:00:00 5 5 1
2 c 0 2020-12-03 00:00:00 5 5 1
2 d 0 2020-12-03 00:00:00 5 5 1
3 a 1 2020-12-03 00:00:00 6 1 1
3 b 1 2020-12-03 00:00:00 6 1 1
3 c 0 2020-12-03 00:00:00 6 1 1
3 d 0 2020-12-03 00:00:00 6 1 1
4 a 0 2020-12-03 00:00:00 5 1 1
4 b 1 2020-12-03 00:00:00 5 1 1
4 c 0 2020-12-03 00:00:00 5 1 1
4 d 0 2020-12-03 00:00:00 5 1 1
5 a 0 2020-12-05 00:00:00 2 0.3 1
5 b 0 2020-12-05 00:00:00 2 0.3 1
5 c 0 2020-12-05 00:00:00 2 0.3 1
5 d 0 2020-12-05 00:00:00 2 0.3 1
6 a 1 2020-12-07 00:00:00 3 6 2
6 b 0 2020-12-07 00:00:00 3 6 2
6 c 1 2020-12-07 00:00:00 3 6 2
6 d 0 2020-12-07 00:00:00 3 6 2
7 a 0 2020-12-08 00:00:00 8 7 2
7 b 1 2020-12-08 00:00:00 8 7 2
7 c 0 2020-12-08 00:00:00 8 7 2
7 d 1 2020-12-08 00:00:00 8 7 2
8 a 0 2020-12-08 00:00:00 7 1 2
8 b 0 2020-12-08 00:00:00 7 1 2
8 c 1 2020-12-08 00:00:00 7 1 2
8 d 0 2020-12-08 00:00:00 7 1 2
9 a 0 2020-12-10 00:00:00 4 1 2
9 b 0 2020-12-10 00:00:00 4 1 2
9 c 0 2020-12-10 00:00:00 4 1 2
9 d 0 2020-12-10 00:00:00 4 1 2
10 a 0 2020-12-11 00:00:00 3 0 3
10 b 0 2020-12-11 00:00:00 3 0 3
10 c 0 2020-12-11 00:00:00 3 0 3
10 d 1 2020-12-11 00:00:00 3 0 3
Now we are ready to aggregate by dgr, abcd and label as required
df2[['col1','col2']] = df2[['col1','col2']].mul(df2['incl'],axis=0)
df3 = df2.groupby(['dgr','abcd']).agg(sum).reset_index()
df3[['col1','col2']] = df3[['col1','col2']].div(df3['incl'],axis=0).fillna(0)
df3['date'] = df3['dgr'].map(lambda g: timeintervals[g+1] + timedelta(days = -1))
df3.drop(columns = ['incl','dgr'])
for the final output
abcd col1 col2 date
-- ------ ------ ------ ----------
0 a 0 0 2020-12-02
1 b 0 0 2020-12-02
2 c 0 0 2020-12-02
3 d 0 0 2020-12-02
4 a 6 1 2020-12-06
5 b 5.5 1 2020-12-06
6 c 0 0 2020-12-06
7 d 0 0 2020-12-06
8 a 3 6 2020-12-10
9 b 8 7 2020-12-10
10 c 5 3.5 2020-12-10
11 d 8 7 2020-12-10
12 a 0 0 2020-12-20
13 b 0 0 2020-12-20
14 c 0 0 2020-12-20
15 d 3 0 2020-12-20
Upvotes: 1