kms
kms

Reputation: 2014

Pandas Groupby with Aggregate, and Quantiles

I am attempting to groupby a pandas DataFrame and calculate quantiles and aggregates from a column.

Here's a sample DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame({
                   'id': [1, 1, 1, 2],
                   'cat': ['p','p','p','n'],
                   'num': [5, 10, 10, 5],
                   'v': [np.nan, np.nan, np.nan, 'v2'],
                   'p': [1000, 1300, 1400, 1100]
                 })

 

I am looking for a solution that can scale with n # of categorical and numeric columns. For numeric and categorical columns, aggregate using mode function.

With p, create two new columns, range of .25 and .75 quantiles and min and max.

Expected output:

id  cat num  v    pquantile     min-max    

1   p   10   NaN  1075 - 1325   1000 - 1400  
2   n   5    v2   1100          1100

Also, aggregate function mode needs to be able to handle a tie.

Upvotes: 0

Views: 329

Answers (1)

Sheldon
Sheldon

Reputation: 4633

As outlined in your question, first group by "cat" and use the agg method to select the "most common value".

df_grouped_by = df.groupby('cat').agg(pd.Series.mode)

Then compute the 0.25 and 0.75 quantiles for each list value in the p column:

df_grouped_by['pquantile'] = df_grouped_by.apply(lambda row : np.quantile(row['p'],[0.25,0.75]), axis = 1)

Finally, compute the min and max values using a similar logic:

df_grouped_by['min-max'] = df_grouped_by.apply(lambda row : [np.min(row['p']), np.max(row['p'])] , axis = 1)

This yields:

cat id  num v   p                   pquantile           min-max                         
n   2   5   v2  1100                [1100.0, 1100.0]    [1100, 1100]
p   1   10  []  [1000, 1300, 1400]  [1150.0, 1350.0]    [1000, 1400]

You may then reindex as you see fit and drop the p column.

Upvotes: 1

Related Questions