user2928791
user2928791

Reputation: 71

Percentiles combined with Pandas groupby/aggregate

I am trying to create a function that computes different percentiles of multiple variables in a data frame. I am using a dict in combination with a Pandas aggregate function as below:

dfG = df.groupby('ClinicalEpisode')
dfA = dfG.agg( { 'Total LOS' : 
                 {'Total LOS P5' : 'pd.quantile(.05)',
                  'Total LOS P10' : 'pd.quantile(.10)',
                  'Total LOS P15' : 'pd.quantile(.15)',
                  'Total LOS P20' : 'pd.quantile(.20)',
                  'Total LOS P25' : 'pd.quantile(.25)',
                  'Total LOS P30' : 'pd.quantile(.30)',
                  'Total LOS P33' : 'pd.quantile(.333333)',
                  'Total LOS P35' : 'pd.quantile(.35)',
                  'Total LOS P40' : 'pd.quantile(.40)',
                  'Total LOS P50' : 'pd.quantile(.50)',
                  'Total LOS P75' : 'pd.quantile(.75)',
                  'Total LOS P80' : 'pd.quantile(.80)',
                  'Total LOS P90' : 'pd.quantile(.90)'},
            'Trigger SNF LOS' :
                 {'Trigger SNF LOS P5' : 'pd.quantile(.05)',
                  'Trigger SNF LOS P10' : 'pd.quantile(.10)',
                  'Trigger SNF LOS P15' : 'pd.quantile(.15)',
                  'Trigger SNF LOS P20' : 'pd.quantile(.20)',
                  'Trigger SNF LOS P25' : 'pd.quantile(.25)',
                  'Trigger SNF LOS P30' : 'pd.quantile(.30)',
                  'Trigger SNF LOS P33' : 'pd.quantile(.333333)',
                  'Trigger SNF LOS P35' : 'pd.quantile(.35)',
                  'Trigger SNF LOS P40' : 'pd.quantile(.40)',
                  'Trigger SNF LOS P50' : 'pd.quantile(.50)',
                  'Trigger SNF LOS P75' : 'pd.quantile(.75)',
                  'Trigger SNF LOS P80' : pd.quantile(.80),
                  'Trigger SNF LOS P90' : pd.quantile(.90)}
            })

I have tried a number of different functions, but nothing seems to work with a dict.

FWIW, I am able to compute these quantiles one variable at a time with code like this:

dfA = df.groupby('ClinicalEpisode')['Total LOS'].quantile(
    [.05, .1, .15, .2, .25, .3, .3333, .35, .4, .5, .6, .7, .75, .8, .9, .95])

Still, I'd really like to be able to use the dict approach. I'm just stuck.

Upvotes: 2

Views: 6521

Answers (1)

TomAugspurger
TomAugspurger

Reputation: 28936

FYI, it's helpful to provide sample data and your expected output. You should also be more explicit than "I'm just stuck".

You've got two problems

  1. There isn't a pandas quantile method. There's a DataFrame.quantile method, but we can't use that. This is related to your second problem.
  2. The aggregation method on your GroupBy object expects functions that take an array and return a single value. We'll use numpy's percentile which takes an array and a percentile,q, between 0 and 100. Like I said, groupby is expecting a function that just takes an array, so let's fix that using functools.partial

Here's how to do that:

In [62]: percentiles = [5, 10, 15, 20, 25, 30, 33, 35, 40, 50, 75, 80, 90]

In [64]: from functools import partial

In [65]: aggs = {'P {}'.format(q): partial(np.percentile, q=q) for q in percentiles}
In [66]: aggs
Out[66]: 
{'P 40': functools.partial(<function percentile at 0x10abde378>, q=40),
 'P 90': functools.partial(<function percentile at 0x10abde378>, q=90),
...}

Now we can pass in aggs.

In [71]: df = pd.DataFrame(np.random.randn(20, 4))

In [72]: df['g'] = np.random.randint(0, 2, 20)

In [73]: df.groupby('g').agg({0: aggs, 1: aggs, 2:aggs})
Out[73]: 
          0                                                              \
       P 40      P 90      P 80      P 20      P 30      P 35      P 75   
g                                                                         
0 -1.451969 -0.134986 -0.466439 -1.726501 -1.475623 -1.463796 -0.632166   
1  0.249210  1.363307  1.029008 -0.644655 -0.241753  0.180993  0.952654   

                                                                      1  \
        P 5      P 15      P 25    P 33      P 50      P 10      P 40   
g                                                                         
0 -2.443653 -1.965552 -1.487451 -2.666927 -1.428315 -2.204603 -1.359988   
1 -1.423351 -0.728314 -0.491645 -1.507900  0.381779 -1.126839  0.261025   

....

You can modify the keys in the dictionary if you want to have Total LOS .... I just had P [percentile] since the column they came from are in the upper level of the MultiIndex.

Upvotes: 7

Related Questions