Reputation: 41
I have a data set with categories/codes e.g. male/female, state of service, code of service and I have a column of paid claims.
I am looking for a way to create a table/pivot using Python to generate outputs where I only have the top 10 highest column of average paid claims by code of service (i.e. what are the top 10 codes with highest average paid claims). I also wanted to append with median, stdev, counts so the output looks something like
Table:
gender, code, state, paid claim
F, 1234, TX, $300
F, 2345, NJ, $120
F, 3456, NJ, $30
M, 1234, MN, $250
M, 4567, CA, $50
F, 1234, MA, $70
F, 8901, CA, $150
F, 23457, NY, $160
F, 4567, SD, $125
Output I am trying to generate (top 10 ave paid claim by code):
code, average claim, median claim, count claim
1234, 206, xxx, 3
So, I did something like:
service_code_average=df.groupby('service_code', as_index=False)['paid claim'].mean().sort_values(by='paid claim')
I was not able to limit to top 10 and I was struggling to append the media and count.
Upvotes: 0
Views: 480
Reputation: 21749
Here you can leverage agg
function where you can specify multiple aggregation function in one go. You can do the following:
# convert string to integer
df['paid claim'] = df['paid claim'].str.extract('(\d+)')
df['paid claim'] = df['paid claim'].astype(int)
# set n value
top_n = 2 ## set this to 10
# apply aggregation
df1 = df.groupby('code').agg({'paid claim':{'average': lambda x: x.nlargest(top_n).mean(),
'counts': lambda x: x.count(),
'median': lambda x: x.median()}})
# reset column names
df1.columns = df1.columns.droplevel()
df1 = df1.reset_index()
print(df1)
code average counts median
0 1234 275.0 3 250.0
1 2345 120.0 1 120.0
2 3456 30.0 1 30.0
3 4567 87.5 2 87.5
4 8901 150.0 1 150.0
5 23457 160.0 1 160.0
Upvotes: 1