Reputation: 2014
I am attempting to groupby
a pandas
DataFrame and calculate quantiles and aggregates from a column.
Here's a sample DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': [1, 1, 1, 2],
'cat': ['p','p','p','n'],
'num': [5, 10, 10, 5],
'v': [np.nan, np.nan, np.nan, 'v2'],
'p': [1000, 1300, 1400, 1100]
})
I am looking for a solution that can scale with n
# of categorical and numeric columns. For numeric
and categorical
columns, aggregate using mode
function.
With p
, create two new columns, range of .25
and .75
quantiles and min
and max
.
Expected output:
id cat num v pquantile min-max
1 p 10 NaN 1075 - 1325 1000 - 1400
2 n 5 v2 1100 1100
Also, aggregate function mode
needs to be able to handle a tie.
Upvotes: 0
Views: 329
Reputation: 4633
As outlined in your question, first group by "cat" and use the agg
method to select the "most common value".
df_grouped_by = df.groupby('cat').agg(pd.Series.mode)
Then compute the 0.25
and 0.75
quantiles for each list value in the p
column:
df_grouped_by['pquantile'] = df_grouped_by.apply(lambda row : np.quantile(row['p'],[0.25,0.75]), axis = 1)
Finally, compute the min and max values using a similar logic:
df_grouped_by['min-max'] = df_grouped_by.apply(lambda row : [np.min(row['p']), np.max(row['p'])] , axis = 1)
This yields:
cat id num v p pquantile min-max
n 2 5 v2 1100 [1100.0, 1100.0] [1100, 1100]
p 1 10 [] [1000, 1300, 1400] [1150.0, 1350.0] [1000, 1400]
You may then reindex as you see fit and drop the p
column.
Upvotes: 1