I have this pandas dataset: date a b c d col1 col2 01-12-20 0 0 0 0 1 0.2 02-12-20 0 0 0 0 2 0.4 03-12-20 0 0 0 0 5 5 03-12-20 1 1 0 0 6 1 03-12-20 0 1 0 0 5 1 05-12-20 0 0 0 0 2 0.3 07-12-20 1 0 1 0 3 6 08-12-20 0 1 0 1 8 7 08-12-20 0 0 1 0 7 1 10-12-20 0 0 0 0 4 1 11-12-20 0 0 0 1 3 0 and the following list of time intervals from datetime import date timeintervals = [date(2020, 12, 1), date(2020, 12, 3), date(2020, 12, 7), date(2020, 12, 11), date(2020, 12, 21)] groupcol = ["a","b","c","d"] I need to obtain a new dataset where: I get the average of col1 and col2, grouped by time interval and by the columns listed in the groupcol variable (considering only rows with 1 and not with zeros). It is like saying the average of "a" for each time interval, when "a" is 1. If "a" has all zeros in an interval, the resulting average value should be zero. name of column "abcd" can be any name; to define time intervals we consider the provided list such that timeintervals[i] <= date < timeintervals[i+1] and proceed recursively; time intervals in the grouped df are labeled by their upperbound minus one day. The result should be this: date abcd col1 col2 02-12-20 a 0 0 02-12-20 b 0 0 02-12-20 c 0 0 02-12-20 d 0 0 06-12-20 a 6 1 06-12-20 b 5.5 1 06-12-20 c 0 0 06-12-20 d 0 0 10-12-20 a 3 6 10-12-20 b 8 7 10-12-20 c 5 3.5 10-12-20 d 8 7 20-12-20 a 0 0 20-12-20 b 0 0 20-12-20 c 0 0 20-12-20 d 3 0 I am having trouble finding an efficient solution for this.

Reputation: 453

Pandas average by varying time intervals and group by columns

I have this pandas dataset:

date	a	b	c	d	col1	col2
01-12-20	0	0	0	0	1	0.2
02-12-20	0	0	0	0	2	0.4
03-12-20	0	0	0	0	5	5
03-12-20	1	1	0	0	6	1
03-12-20	0	1	0	0	5	1
05-12-20	0	0	0	0	2	0.3
07-12-20	1	0	1	0	3	6
08-12-20	0	1	0	1	8	7
08-12-20	0	0	1	0	7	1
10-12-20	0	0	0	0	4	1
11-12-20	0	0	0	1	3	0

and the following list of time intervals

from datetime import date
timeintervals = [date(2020, 12, 1), date(2020, 12, 3), date(2020, 12, 7), date(2020, 12, 11), date(2020, 12, 21)]
groupcol = ["a","b","c","d"]

I need to obtain a new dataset where:

I get the average of col1 and col2, grouped by time interval and by the columns listed in the groupcol variable (considering only rows with 1 and not with zeros). It is like saying the average of "a" for each time interval, when "a" is 1. If "a" has all zeros in an interval, the resulting average value should be zero.
name of column "abcd" can be any name;
to define time intervals we consider the provided list such that timeintervals[i] <= date < timeintervals[i+1] and proceed recursively;
time intervals in the grouped df are labeled by their upperbound minus one day.

The result should be this:

date	abcd	col1	col2
02-12-20	a	0	0
02-12-20	b	0	0
02-12-20	c	0	0
02-12-20	d	0	0
06-12-20	a	6	1
06-12-20	b	5.5	1
06-12-20	c	0	0
06-12-20	d	0	0
10-12-20	a	3	6
10-12-20	b	8	7
10-12-20	c	5	3.5
10-12-20	d	8	7
20-12-20	a	0	0
20-12-20	b	0	0
20-12-20	c	0	0
20-12-20	d	3	0

I am having trouble finding an efficient solution for this.

Upvotes: 1

Answers (1)

piterbarg

Reputation: 8219

Let us do it in steps

First we add a column that specifies which group we allocate each date to

df['dgr'] = df['date'].apply(lambda d: [d.date()<l for l in timeintervals].index(True)-1)
df

to obtain

    date                   a    b    c    d    col1    col2    dgr
--  -------------------  ---  ---  ---  ---  ------  ------  -----
 0  2020-12-01 00:00:00    0    0    0    0       1     0.2      0
 1  2020-12-02 00:00:00    0    0    0    0       2     0.4      0
 2  2020-12-03 00:00:00    0    0    0    0       5     5        1
 3  2020-12-03 00:00:00    1    1    0    0       6     1        1
 4  2020-12-03 00:00:00    0    1    0    0       5     1        1
 5  2020-12-05 00:00:00    0    0    0    0       2     0.3      1
 6  2020-12-07 00:00:00    1    0    1    0       3     6        2
 7  2020-12-08 00:00:00    0    1    0    1       8     7        2
 8  2020-12-08 00:00:00    0    0    1    0       7     1        2
 9  2020-12-10 00:00:00    0    0    0    0       4     1        2
10  2020-12-11 00:00:00    0    0    0    1       3     0        3

The following command prepares the 'skeleton' of the final result -- basically unrolls a,b,c,d into a column and keeps tracks of 0,1s in the original df

df2 = (df[groupcol]
    .stack()
    .reset_index(level = 1)
    .rename(columns = {'level_1' : 'abcd',0:'incl'})
)
df2

to get

    abcd      incl
--  ------  ------
 0  a            0
 0  b            0
 0  c            0
 0  d            0
 1  a            0
 1  b            0
 1  c            0
 1  d            0
 2  a            0
 2  b            0
 2  c            0
 2  d            0
 3  a            1
 3  b            1
 3  c            0
 3  d            0
 4  a            0
 4  b            1
 4  c            0
 4  d            0
 5  a            0
 5  b            0
 5  c            0
 5  d            0
 6  a            1
 6  b            0
 6  c            1
 6  d            0
 7  a            0
 7  b            1
 7  c            0
 7  d            1
 8  a            0
 8  b            0
 8  c            1
 8  d            0
 9  a            0
 9  b            0
 9  c            0
 9  d            0
10  a            0
10  b            0
10  c            0
10  d            1

Now we can join df2 with df to fill in the date,col1, col2 details

df2 = df2.join(df[['date','col1','col2','dgr']])

to get

    abcd      incl  date                   col1    col2    dgr
--  ------  ------  -------------------  ------  ------  -----
 0  a            0  2020-12-01 00:00:00       1     0.2      0
 0  b            0  2020-12-01 00:00:00       1     0.2      0
 0  c            0  2020-12-01 00:00:00       1     0.2      0
 0  d            0  2020-12-01 00:00:00       1     0.2      0
 1  a            0  2020-12-02 00:00:00       2     0.4      0
 1  b            0  2020-12-02 00:00:00       2     0.4      0
 1  c            0  2020-12-02 00:00:00       2     0.4      0
 1  d            0  2020-12-02 00:00:00       2     0.4      0
 2  a            0  2020-12-03 00:00:00       5     5        1
 2  b            0  2020-12-03 00:00:00       5     5        1
 2  c            0  2020-12-03 00:00:00       5     5        1
 2  d            0  2020-12-03 00:00:00       5     5        1
 3  a            1  2020-12-03 00:00:00       6     1        1
 3  b            1  2020-12-03 00:00:00       6     1        1
 3  c            0  2020-12-03 00:00:00       6     1        1
 3  d            0  2020-12-03 00:00:00       6     1        1
 4  a            0  2020-12-03 00:00:00       5     1        1
 4  b            1  2020-12-03 00:00:00       5     1        1
 4  c            0  2020-12-03 00:00:00       5     1        1
 4  d            0  2020-12-03 00:00:00       5     1        1
 5  a            0  2020-12-05 00:00:00       2     0.3      1
 5  b            0  2020-12-05 00:00:00       2     0.3      1
 5  c            0  2020-12-05 00:00:00       2     0.3      1
 5  d            0  2020-12-05 00:00:00       2     0.3      1
 6  a            1  2020-12-07 00:00:00       3     6        2
 6  b            0  2020-12-07 00:00:00       3     6        2
 6  c            1  2020-12-07 00:00:00       3     6        2
 6  d            0  2020-12-07 00:00:00       3     6        2
 7  a            0  2020-12-08 00:00:00       8     7        2
 7  b            1  2020-12-08 00:00:00       8     7        2
 7  c            0  2020-12-08 00:00:00       8     7        2
 7  d            1  2020-12-08 00:00:00       8     7        2
 8  a            0  2020-12-08 00:00:00       7     1        2
 8  b            0  2020-12-08 00:00:00       7     1        2
 8  c            1  2020-12-08 00:00:00       7     1        2
 8  d            0  2020-12-08 00:00:00       7     1        2
 9  a            0  2020-12-10 00:00:00       4     1        2
 9  b            0  2020-12-10 00:00:00       4     1        2
 9  c            0  2020-12-10 00:00:00       4     1        2
 9  d            0  2020-12-10 00:00:00       4     1        2
10  a            0  2020-12-11 00:00:00       3     0        3
10  b            0  2020-12-11 00:00:00       3     0        3
10  c            0  2020-12-11 00:00:00       3     0        3
10  d            1  2020-12-11 00:00:00       3     0        3

Now we are ready to aggregate by dgr, abcd and label as required

df2[['col1','col2']] = df2[['col1','col2']].mul(df2['incl'],axis=0)
df3 = df2.groupby(['dgr','abcd']).agg(sum).reset_index()
df3[['col1','col2']] = df3[['col1','col2']].div(df3['incl'],axis=0).fillna(0)
df3['date'] = df3['dgr'].map(lambda g: timeintervals[g+1] + timedelta(days = -1))
df3.drop(columns = ['incl','dgr'])

for the final output

    abcd      col1    col2  date
--  ------  ------  ------  ----------
 0  a          0       0    2020-12-02
 1  b          0       0    2020-12-02
 2  c          0       0    2020-12-02
 3  d          0       0    2020-12-02
 4  a          6       1    2020-12-06
 5  b          5.5     1    2020-12-06
 6  c          0       0    2020-12-06
 7  d          0       0    2020-12-06
 8  a          3       6    2020-12-10
 9  b          8       7    2020-12-10
10  c          5       3.5  2020-12-10
11  d          8       7    2020-12-10
12  a          0       0    2020-12-20
13  b          0       0    2020-12-20
14  c          0       0    2020-12-20
15  d          3       0    2020-12-20

Upvotes: 1

Pandas average by varying time intervals and group by columns

Answers (1)

Related Questions