user3447653
user3447653

Reputation: 4158

Group by and join columns in Pandas Dataframe

For every column in the list cat_column, I need to loop over the list numerical_cols and get the mean and standard deviation. I have the code below that does it. But in the end of second loop, I need to have a final table with the respective cat_column and mean and standard deviation of the all numerical columns like below.

Code1 Code2 Mean_Code1_CarAge  Std_Code1_CarAge  Mean_Code1_CarPrice   Std_Code1_CarPrice Mean_Code2_CarAge  Std_Code2_CarAge  Mean_Code2_CarPrice   Std_Code2_CarPrice

Code:

cat_column = ["Code1", "Code2"]
numerical_cols = ['CarAge', 'CarPrice']

for base_col in cat_column :
  for col in numerical_cols:
    df = df.groupby(base_col)[col].agg([np.mean, np.std]).reset_index().rename(
        columns={'mean': 'mean_'+base_col+"_"+col, 'std': 'std_'+base_col+"_"+col})

Input:

     Code1 Code2 CarAge CarPrice
      AAA   AA1      12    5000 
      BBB   BB1      30   10000 
      CCC   CC1      64   22000 
      AAA   AA1      19    4000 
      BBB   BB1      49   10000 

Output:

   Code1  Code2 Mean_Code1_CarAge Std_Code1_CarAge Mean_Code1_CarPrice Std_Code1_CarPrice Mean_Code2_CarAge Std_Code2_CarAge Mean_Code2_CarPrice Std_Code2_CarPrice
   AAA   AA1  15.5      4.95     4500    707.10   15.5      4.95     4500    707.10
   BBB   BB1  9.5      13.43    10000   0.00   9.5      13.43    10000   0.00
   CCC   CC1  64.0      NaN      22000   NaN   64.0      NaN      22000   NaN

Not sure how to do that dynamically in the above code. Any leads/suggestions would be appreciated.

Upvotes: 2

Views: 58

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35646

Try groupby aggregate using a dictionary made from the values in numerical_cols then reduce the multi-index using map, lastly concat on axis=1:

import pandas as pd

df = pd.DataFrame({
    'Code1': {0: 'AAA', 1: 'BBB', 2: 'CCC', 3: 'AAA', 4: 'BBB'},
    'Code2': {0: 'AA1', 1: 'BB1', 2: 'CC1', 3: 'AA1', 4: 'BB1'},
    'CarAge': {0: 12, 1: 30, 2: 64, 3: 19, 4: 49},
    'CarPrice': {0: 5000, 1: 10000, 2: 22000, 3: 4000, 4: 10000}}
)

cat_columns = ["Code1", "Code2"]
numerical_cols = ['CarAge', 'CarPrice']

# Create a dictionary to map keys to aggregation types
agg_d = {k: ['mean', 'std'] for k in numerical_cols}

dfs = []

for cat_column in cat_columns:
    # Groupby Agg to get aggs for each key in agg_d per group
    g = df.groupby(cat_column).aggregate(agg_d)
    # Reduce Multi Index
    g.columns = g.columns.map(lambda x: f'{x[1]}_{cat_column}_{x[0]}')
    # Reset Index
    g = g.reset_index()
    dfs.append(g)

# Concat on Axis 1
new_df = pd.concat(dfs, axis=1)

# Re Order Columns
new_df = new_df[[*cat_columns, *new_df.columns.difference(cat_columns)]]

print(new_df.to_string())

new_df:

  Code1 Code2  mean_Code1_CarAge  mean_Code1_CarPrice  mean_Code2_CarAge  mean_Code2_CarPrice  std_Code1_CarAge  std_Code1_CarPrice  std_Code2_CarAge  std_Code2_CarPrice
0   AAA   AA1               15.5                 4500               15.5                 4500          4.949747          707.106781          4.949747          707.106781
1   BBB   BB1               39.5                10000               39.5                10000         13.435029            0.000000         13.435029            0.000000
2   CCC   CC1               64.0                22000               64.0                22000               NaN                 NaN               NaN                 NaN

Upvotes: 2

Related Questions