Reputation: 873
I have sales data of the form:
ID,year,sales,bonus
1,2015,1000,500
1,2015,1400,590
1,2016,1100,200
1,2017,1200,800
1,2017,1700,300
1,2017,1900,510
1,2018,2000,560
1,2018,1700,600
2,2015,2000,400
2,2015,1450,580
2,2015,2100,300
2,2017,1400,770
2,2017,2700,330
2,2018,3900,610
2,2018,2000,530
2,2018,1700,700
3,2015,2900,406
3,2015,1450,580
3,2015,2100,300
3,2017,1450,777
3,2018,3100,330
3,2018,3900,610
3,2019,2000,530
3,2019,1900,730
I want to include two new columns to capture the 10th and 90th percentile for the sales and bonus column for each ID for each year (I appreciate this doesn't make much sense in this simplified dataset)
I have used the following approach:
def aggs(df, names):
for i in range(0, len(names)):
df[names[i] + '90'] = df[names[i]].quantile(.90)
df[names[i] + '10'] = df[names[i]].quantile(.10)
return df
sales = pd.read_csv("sales.csv")
names = ['sales', 'bonus']
sales2 = sales.groupby(['year', 'ID'], as_index=False)
sales2 = sales2.apply(aggs, names)
To produce
ID year sales bonus sales90 sales10 bonus90 bonus10
0 1 2015 1000 500 1360.0 1040.0 581.0 509.0
1 1 2015 1400 590 1360.0 1040.0 581.0 509.0
2 1 2016 1100 200 1100.0 1100.0 200.0 200.0
3 1 2017 1200 800 1860.0 1300.0 742.0 342.0
4 1 2017 1700 300 1860.0 1300.0 742.0 342.0
5 1 2017 1900 510 1860.0 1300.0 742.0 342.0
6 1 2018 2000 560 1970.0 1730.0 596.0 564.0
7 1 2018 1700 600 1970.0 1730.0 596.0 564.0
8 2 2015 2000 400 2080.0 1560.0 544.0 320.0
9 2 2015 1450 580 2080.0 1560.0 544.0 320.0
10 2 2015 2100 300 2080.0 1560.0 544.0 320.0
11 2 2017 1400 770 2570.0 1530.0 726.0 374.0
12 2 2017 2700 330 2570.0 1530.0 726.0 374.0
13 2 2018 3900 610 3520.0 1760.0 682.0 546.0
14 2 2018 2000 530 3520.0 1760.0 682.0 546.0
15 2 2018 1700 700 3520.0 1760.0 682.0 546.0
16 3 2015 2900 406 2740.0 1580.0 545.2 321.2
17 3 2015 1450 580 2740.0 1580.0 545.2 321.2
18 3 2015 2100 300 2740.0 1580.0 545.2 321.2
19 3 2017 1450 777 1450.0 1450.0 777.0 777.0
20 3 2018 3100 330 3820.0 3180.0 582.0 358.0
21 3 2018 3900 610 3820.0 3180.0 582.0 358.0
22 3 2019 2000 530 1990.0 1910.0 710.0 550.0
Problem
The code works as desired but as the number of rows and columns increases, there are big performance issues. I also get the warning:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()
Can anyone advise how I could improve this approach? thanks!
Upvotes: 1
Views: 152
Reputation: 15962
The df.quantile()
method can be passed a collection of quantiles to compute. And it does so on the index (by default).
Change the columns from Multi-level to single-level and rename.
merge()
allows merging easily when there columns in common, so use that with qs.reset_index()
to get year and ID back as columns.
qs = sales.groupby(['year', 'ID']).quantile([.90, .10]).unstack()
qs.columns = [f'{c}{int(n*100)}' for c, n in qs.columns]
result = sales.merge(qs.reset_index())
Here are the dataframes in the intermediate steps:
qs = sales.groupby(['year', 'ID']).quantile([.90, .10]).unstack()
# qs is:
sales bonus
0.9 0.1 0.9 0.1
year ID
2015 1 1360.0 1040.0 581.0 509.0
2 2080.0 1560.0 544.0 320.0
3 2740.0 1580.0 545.2 321.2
2016 1 1100.0 1100.0 200.0 200.0
2017 1 1860.0 1300.0 742.0 342.0
2 2570.0 1530.0 726.0 374.0
3 1450.0 1450.0 777.0 777.0
2018 1 1970.0 1730.0 596.0 564.0
2 3520.0 1760.0 682.0 546.0
3 3820.0 3180.0 582.0 358.0
2019 3 1990.0 1910.0 710.0 550.0
qs.columns = [f'{c}{int(n*100)}' for c, n in qs.columns]
# qs is now
sales90 sales10 bonus90 bonus10
year ID
2015 1 1360.0 1040.0 581.0 509.0
2 2080.0 1560.0 544.0 320.0
3 2740.0 1580.0 545.2 321.2
2016 1 1100.0 1100.0 200.0 200.0
2017 1 1860.0 1300.0 742.0 342.0
...
result = sales.merge(qs.reset_index())
# result:
year ID sales bonus sales90 sales10 bonus90 bonus10
0 2015 1 1000 500 1360.0 1040.0 581.0 509.0
1 2015 1 1400 590 1360.0 1040.0 581.0 509.0
2 2015 2 2000 400 2080.0 1560.0 544.0 320.0
3 2015 2 1450 580 2080.0 1560.0 544.0 320.0
4 2015 2 2100 300 2080.0 1560.0 544.0 320.0
5 2015 3 2900 406 2740.0 1580.0 545.2 321.2
6 2015 3 1450 580 2740.0 1580.0 545.2 321.2
7 2015 3 2100 300 2740.0 1580.0 545.2 321.2
8 2016 1 1100 200 1100.0 1100.0 200.0 200.0
9 2017 1 1200 800 1860.0 1300.0 742.0 342.0
10 2017 1 1700 300 1860.0 1300.0 742.0 342.0
11 2017 1 1900 510 1860.0 1300.0 742.0 342.0
12 2017 2 1400 770 2570.0 1530.0 726.0 374.0
13 2017 2 2700 330 2570.0 1530.0 726.0 374.0
14 2017 3 1450 777 1450.0 1450.0 777.0 777.0
15 2018 1 2000 560 1970.0 1730.0 596.0 564.0
16 2018 1 1700 600 1970.0 1730.0 596.0 564.0
17 2018 2 3900 610 3520.0 1760.0 682.0 546.0
18 2018 2 2000 530 3520.0 1760.0 682.0 546.0
19 2018 2 1700 700 3520.0 1760.0 682.0 546.0
20 2018 3 3100 330 3820.0 3180.0 582.0 358.0
21 2018 3 3900 610 3820.0 3180.0 582.0 358.0
22 2019 3 2000 530 1990.0 1910.0 710.0 550.0
23 2019 3 1900 730 1990.0 1910.0 710.0 550.0
Upvotes: 1