Sjoseph
Sjoseph

Reputation: 873

Improve efficiency of groupby and apply custom function to improve performance

I have sales data of the form:

ID,year,sales,bonus
1,2015,1000,500
1,2015,1400,590
1,2016,1100,200
1,2017,1200,800
1,2017,1700,300
1,2017,1900,510
1,2018,2000,560
1,2018,1700,600
2,2015,2000,400
2,2015,1450,580
2,2015,2100,300
2,2017,1400,770
2,2017,2700,330
2,2018,3900,610
2,2018,2000,530
2,2018,1700,700
3,2015,2900,406
3,2015,1450,580
3,2015,2100,300
3,2017,1450,777
3,2018,3100,330
3,2018,3900,610
3,2019,2000,530
3,2019,1900,730

I want to include two new columns to capture the 10th and 90th percentile for the sales and bonus column for each ID for each year (I appreciate this doesn't make much sense in this simplified dataset)

I have used the following approach:

def aggs(df, names):

    for i in range(0, len(names)):
        df[names[i] + '90'] = df[names[i]].quantile(.90)
        df[names[i] + '10'] = df[names[i]].quantile(.10)

    return df

sales = pd.read_csv("sales.csv")
names = ['sales', 'bonus']
sales2 = sales.groupby(['year', 'ID'], as_index=False)
sales2 = sales2.apply(aggs, names)

To produce

    ID  year    sales   bonus   sales90 sales10 bonus90 bonus10
0   1   2015    1000    500 1360.0  1040.0  581.0   509.0
1   1   2015    1400    590 1360.0  1040.0  581.0   509.0
2   1   2016    1100    200 1100.0  1100.0  200.0   200.0
3   1   2017    1200    800 1860.0  1300.0  742.0   342.0
4   1   2017    1700    300 1860.0  1300.0  742.0   342.0
5   1   2017    1900    510 1860.0  1300.0  742.0   342.0
6   1   2018    2000    560 1970.0  1730.0  596.0   564.0
7   1   2018    1700    600 1970.0  1730.0  596.0   564.0
8   2   2015    2000    400 2080.0  1560.0  544.0   320.0
9   2   2015    1450    580 2080.0  1560.0  544.0   320.0
10  2   2015    2100    300 2080.0  1560.0  544.0   320.0
11  2   2017    1400    770 2570.0  1530.0  726.0   374.0
12  2   2017    2700    330 2570.0  1530.0  726.0   374.0
13  2   2018    3900    610 3520.0  1760.0  682.0   546.0
14  2   2018    2000    530 3520.0  1760.0  682.0   546.0
15  2   2018    1700    700 3520.0  1760.0  682.0   546.0
16  3   2015    2900    406 2740.0  1580.0  545.2   321.2
17  3   2015    1450    580 2740.0  1580.0  545.2   321.2
18  3   2015    2100    300 2740.0  1580.0  545.2   321.2
19  3   2017    1450    777 1450.0  1450.0  777.0   777.0
20  3   2018    3100    330 3820.0  3180.0  582.0   358.0
21  3   2018    3900    610 3820.0  3180.0  582.0   358.0
22  3   2019    2000    530 1990.0  1910.0  710.0   550.0

Problem

The code works as desired but as the number of rows and columns increases, there are big performance issues. I also get the warning:

PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()

Can anyone advise how I could improve this approach? thanks!

Upvotes: 1

Views: 152

Answers (1)

aneroid
aneroid

Reputation: 15962

  1. The df.quantile() method can be passed a collection of quantiles to compute. And it does so on the index (by default).

    • Each quantile gets added as a level on the index. So put your groupby terms (year, ID) as an index, and then unstack it to get it as a multi-level column.
  2. Change the columns from Multi-level to single-level and rename.

  3. merge() allows merging easily when there columns in common, so use that with qs.reset_index() to get year and ID back as columns.

qs = sales.groupby(['year', 'ID']).quantile([.90, .10]).unstack()
qs.columns = [f'{c}{int(n*100)}' for c, n in qs.columns]
result = sales.merge(qs.reset_index())

Here are the dataframes in the intermediate steps:

qs = sales.groupby(['year', 'ID']).quantile([.90, .10]).unstack()
# qs is:
          sales          bonus
            0.9     0.1    0.9    0.1
year ID
2015 1   1360.0  1040.0  581.0  509.0
     2   2080.0  1560.0  544.0  320.0
     3   2740.0  1580.0  545.2  321.2
2016 1   1100.0  1100.0  200.0  200.0
2017 1   1860.0  1300.0  742.0  342.0
     2   2570.0  1530.0  726.0  374.0
     3   1450.0  1450.0  777.0  777.0
2018 1   1970.0  1730.0  596.0  564.0
     2   3520.0  1760.0  682.0  546.0
     3   3820.0  3180.0  582.0  358.0
2019 3   1990.0  1910.0  710.0  550.0
qs.columns = [f'{c}{int(n*100)}' for c, n in qs.columns]
# qs is now
         sales90  sales10  bonus90  bonus10
year ID
2015 1    1360.0   1040.0    581.0    509.0
     2    2080.0   1560.0    544.0    320.0
     3    2740.0   1580.0    545.2    321.2
2016 1    1100.0   1100.0    200.0    200.0
2017 1    1860.0   1300.0    742.0    342.0
...
result = sales.merge(qs.reset_index())
# result:
    year  ID  sales  bonus  sales90  sales10  bonus90  bonus10
0   2015   1   1000    500   1360.0   1040.0    581.0    509.0
1   2015   1   1400    590   1360.0   1040.0    581.0    509.0
2   2015   2   2000    400   2080.0   1560.0    544.0    320.0
3   2015   2   1450    580   2080.0   1560.0    544.0    320.0
4   2015   2   2100    300   2080.0   1560.0    544.0    320.0
5   2015   3   2900    406   2740.0   1580.0    545.2    321.2
6   2015   3   1450    580   2740.0   1580.0    545.2    321.2
7   2015   3   2100    300   2740.0   1580.0    545.2    321.2
8   2016   1   1100    200   1100.0   1100.0    200.0    200.0
9   2017   1   1200    800   1860.0   1300.0    742.0    342.0
10  2017   1   1700    300   1860.0   1300.0    742.0    342.0
11  2017   1   1900    510   1860.0   1300.0    742.0    342.0
12  2017   2   1400    770   2570.0   1530.0    726.0    374.0
13  2017   2   2700    330   2570.0   1530.0    726.0    374.0
14  2017   3   1450    777   1450.0   1450.0    777.0    777.0
15  2018   1   2000    560   1970.0   1730.0    596.0    564.0
16  2018   1   1700    600   1970.0   1730.0    596.0    564.0
17  2018   2   3900    610   3520.0   1760.0    682.0    546.0
18  2018   2   2000    530   3520.0   1760.0    682.0    546.0
19  2018   2   1700    700   3520.0   1760.0    682.0    546.0
20  2018   3   3100    330   3820.0   3180.0    582.0    358.0
21  2018   3   3900    610   3820.0   3180.0    582.0    358.0
22  2019   3   2000    530   1990.0   1910.0    710.0    550.0
23  2019   3   1900    730   1990.0   1910.0    710.0    550.0

Upvotes: 1

Related Questions