JCV
JCV

Reputation: 515

How to create groups and subgroups in pandas dataframe

I have a pandas dataframe with laboratory tests, and for each sample, I have a few tests. For example, in this synthetic data I have 3 sample each one with 3 test pieces.

I need to get the value of a column, based on the maximum value of another column for each test, and all samples. So at the end, I would have a pandas dataframe with 9 rows.

The synthetic data is:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
name = np.array(['PI-01','PI-02','PI-03'])
name = np.repeat(name,30)
ID = np.array(['CP-A','CP-B','CP-C'])
ID = np.repeat(ID,10)
ID = np.tile(ID,3)
sig_3 = np.array([100,200,300])
sig_3 = np.repeat(sig_3,10)
sig_3 = np.tile(sig_3,3)
np.random.seed(42)
def_ax1 = np.sort(np.random.uniform(0.3,15.,10))
def_ax2 = np.sort(np.random.uniform(0.3,15.,10))
def_ax3 = np.sort(np.random.uniform(0.3,15.,10))
def_ax = np.concatenate((def_ax1,def_ax2,def_ax3),axis=0)
def_ax = np.tile(def_ax,3)
q = np.linspace(100,500,90)
p = np.linspace(100,800,90)
du = np.random.uniform(-20,500,90)
tx = pd.DataFrame({'Name':name, 'ID':ID,'sig_3':sig_3,'def_ax':def_ax,'q':q,'p':p,'du':du})

and the code I'm trying is:

max_du = tx.loc[tx.groupby('Name')['du'].idxmax()]

But gives the maximum for all group name, and not for the subgroup ID. I guess I need to group by Name, and for ID, and then find the max of du, how could I group like this? the result I would like is:

PI01 - max du for CP-01
       max du for CP-02
       max du for CP-03

PI02 - max du for CP-01
       max du for CP-02
       max du for CP-03

PI03 - max du for CP-01
       max du for CP-02
       max du for CP-03

Upvotes: 1

Views: 2467

Answers (1)

guardian
guardian

Reputation: 311

First of all, well done for postig sample data and expected output to this question!!

With groupby you don't need to use tx.loc, here your answer:

tx.groupby(['Name','ID'])['du'].max()  

groupby:

  • main group: Name
  • sub group: ID
  • ['du'] - column of interest
  • .max() - called method

after calling the columns you need a method (since x values must be compressed in the cell. ex: .unique(), .sizem(), .min(),.mean(),.max(), etc...

Alternative:

I like to use .agg, since it allows me to use functions (numpy, lambda, etc.) instead of methods and I have a better overview with it.

max_du=tx.groupby(['Name','ID']).agg({'du':np.max}) 

Here an example with more features:

tx.groupby(['Name','ID']).agg({'du':(np.max,np.size,np.mean), 'q': lambda x: list(pd.unique(x))})

.agg allows us to get more statistics from column 'du' and a list of all 'q' values

tx.loc[tx.groupby(['Name','ID'])['du'].idxmax()] returns a part of the dataframe. Exactly where du hat its max (for the above groups).

Upvotes: 3

Related Questions