Forinstance
Forinstance

Reputation: 453

Pandas average by varying time intervals and group by columns

I have this pandas dataset:

date a b c d col1 col2
01-12-20 0 0 0 0 1 0.2
02-12-20 0 0 0 0 2 0.4
03-12-20 0 0 0 0 5 5
03-12-20 1 1 0 0 6 1
03-12-20 0 1 0 0 5 1
05-12-20 0 0 0 0 2 0.3
07-12-20 1 0 1 0 3 6
08-12-20 0 1 0 1 8 7
08-12-20 0 0 1 0 7 1
10-12-20 0 0 0 0 4 1
11-12-20 0 0 0 1 3 0

and the following list of time intervals

from datetime import date
timeintervals = [date(2020, 12, 1), date(2020, 12, 3), date(2020, 12, 7), date(2020, 12, 11), date(2020, 12, 21)]
groupcol = ["a","b","c","d"]

I need to obtain a new dataset where:

The result should be this:

date abcd col1 col2
02-12-20 a 0 0
02-12-20 b 0 0
02-12-20 c 0 0
02-12-20 d 0 0
06-12-20 a 6 1
06-12-20 b 5.5 1
06-12-20 c 0 0
06-12-20 d 0 0
10-12-20 a 3 6
10-12-20 b 8 7
10-12-20 c 5 3.5
10-12-20 d 8 7
20-12-20 a 0 0
20-12-20 b 0 0
20-12-20 c 0 0
20-12-20 d 3 0

I am having trouble finding an efficient solution for this.

Upvotes: 1

Views: 567

Answers (1)

piterbarg
piterbarg

Reputation: 8219

Let us do it in steps

First we add a column that specifies which group we allocate each date to

df['dgr'] = df['date'].apply(lambda d: [d.date()<l for l in timeintervals].index(True)-1)
df

to obtain

    date                   a    b    c    d    col1    col2    dgr
--  -------------------  ---  ---  ---  ---  ------  ------  -----
 0  2020-12-01 00:00:00    0    0    0    0       1     0.2      0
 1  2020-12-02 00:00:00    0    0    0    0       2     0.4      0
 2  2020-12-03 00:00:00    0    0    0    0       5     5        1
 3  2020-12-03 00:00:00    1    1    0    0       6     1        1
 4  2020-12-03 00:00:00    0    1    0    0       5     1        1
 5  2020-12-05 00:00:00    0    0    0    0       2     0.3      1
 6  2020-12-07 00:00:00    1    0    1    0       3     6        2
 7  2020-12-08 00:00:00    0    1    0    1       8     7        2
 8  2020-12-08 00:00:00    0    0    1    0       7     1        2
 9  2020-12-10 00:00:00    0    0    0    0       4     1        2
10  2020-12-11 00:00:00    0    0    0    1       3     0        3

The following command prepares the 'skeleton' of the final result -- basically unrolls a,b,c,d into a column and keeps tracks of 0,1s in the original df

df2 = (df[groupcol]
    .stack()
    .reset_index(level = 1)
    .rename(columns = {'level_1' : 'abcd',0:'incl'})
)
df2

to get

    abcd      incl
--  ------  ------
 0  a            0
 0  b            0
 0  c            0
 0  d            0
 1  a            0
 1  b            0
 1  c            0
 1  d            0
 2  a            0
 2  b            0
 2  c            0
 2  d            0
 3  a            1
 3  b            1
 3  c            0
 3  d            0
 4  a            0
 4  b            1
 4  c            0
 4  d            0
 5  a            0
 5  b            0
 5  c            0
 5  d            0
 6  a            1
 6  b            0
 6  c            1
 6  d            0
 7  a            0
 7  b            1
 7  c            0
 7  d            1
 8  a            0
 8  b            0
 8  c            1
 8  d            0
 9  a            0
 9  b            0
 9  c            0
 9  d            0
10  a            0
10  b            0
10  c            0
10  d            1

Now we can join df2 with df to fill in the date,col1, col2 details

df2 = df2.join(df[['date','col1','col2','dgr']])

to get

    abcd      incl  date                   col1    col2    dgr
--  ------  ------  -------------------  ------  ------  -----
 0  a            0  2020-12-01 00:00:00       1     0.2      0
 0  b            0  2020-12-01 00:00:00       1     0.2      0
 0  c            0  2020-12-01 00:00:00       1     0.2      0
 0  d            0  2020-12-01 00:00:00       1     0.2      0
 1  a            0  2020-12-02 00:00:00       2     0.4      0
 1  b            0  2020-12-02 00:00:00       2     0.4      0
 1  c            0  2020-12-02 00:00:00       2     0.4      0
 1  d            0  2020-12-02 00:00:00       2     0.4      0
 2  a            0  2020-12-03 00:00:00       5     5        1
 2  b            0  2020-12-03 00:00:00       5     5        1
 2  c            0  2020-12-03 00:00:00       5     5        1
 2  d            0  2020-12-03 00:00:00       5     5        1
 3  a            1  2020-12-03 00:00:00       6     1        1
 3  b            1  2020-12-03 00:00:00       6     1        1
 3  c            0  2020-12-03 00:00:00       6     1        1
 3  d            0  2020-12-03 00:00:00       6     1        1
 4  a            0  2020-12-03 00:00:00       5     1        1
 4  b            1  2020-12-03 00:00:00       5     1        1
 4  c            0  2020-12-03 00:00:00       5     1        1
 4  d            0  2020-12-03 00:00:00       5     1        1
 5  a            0  2020-12-05 00:00:00       2     0.3      1
 5  b            0  2020-12-05 00:00:00       2     0.3      1
 5  c            0  2020-12-05 00:00:00       2     0.3      1
 5  d            0  2020-12-05 00:00:00       2     0.3      1
 6  a            1  2020-12-07 00:00:00       3     6        2
 6  b            0  2020-12-07 00:00:00       3     6        2
 6  c            1  2020-12-07 00:00:00       3     6        2
 6  d            0  2020-12-07 00:00:00       3     6        2
 7  a            0  2020-12-08 00:00:00       8     7        2
 7  b            1  2020-12-08 00:00:00       8     7        2
 7  c            0  2020-12-08 00:00:00       8     7        2
 7  d            1  2020-12-08 00:00:00       8     7        2
 8  a            0  2020-12-08 00:00:00       7     1        2
 8  b            0  2020-12-08 00:00:00       7     1        2
 8  c            1  2020-12-08 00:00:00       7     1        2
 8  d            0  2020-12-08 00:00:00       7     1        2
 9  a            0  2020-12-10 00:00:00       4     1        2
 9  b            0  2020-12-10 00:00:00       4     1        2
 9  c            0  2020-12-10 00:00:00       4     1        2
 9  d            0  2020-12-10 00:00:00       4     1        2
10  a            0  2020-12-11 00:00:00       3     0        3
10  b            0  2020-12-11 00:00:00       3     0        3
10  c            0  2020-12-11 00:00:00       3     0        3
10  d            1  2020-12-11 00:00:00       3     0        3

Now we are ready to aggregate by dgr, abcd and label as required

df2[['col1','col2']] = df2[['col1','col2']].mul(df2['incl'],axis=0)
df3 = df2.groupby(['dgr','abcd']).agg(sum).reset_index()
df3[['col1','col2']] = df3[['col1','col2']].div(df3['incl'],axis=0).fillna(0)
df3['date'] = df3['dgr'].map(lambda g: timeintervals[g+1] + timedelta(days = -1))
df3.drop(columns = ['incl','dgr'])

for the final output

    abcd      col1    col2  date
--  ------  ------  ------  ----------
 0  a          0       0    2020-12-02
 1  b          0       0    2020-12-02
 2  c          0       0    2020-12-02
 3  d          0       0    2020-12-02
 4  a          6       1    2020-12-06
 5  b          5.5     1    2020-12-06
 6  c          0       0    2020-12-06
 7  d          0       0    2020-12-06
 8  a          3       6    2020-12-10
 9  b          8       7    2020-12-10
10  c          5       3.5  2020-12-10
11  d          8       7    2020-12-10
12  a          0       0    2020-12-20
13  b          0       0    2020-12-20
14  c          0       0    2020-12-20
15  d          3       0    2020-12-20

Upvotes: 1

Related Questions