thehand0
thehand0

Reputation: 1163

Get values from one column corresponding to the minimum value of another column for a subset of rows

Apologies if the question isn't totally clear. However, I do have some example code showing the desired input and output (see below).

I have a (large) pandas dataframe and want to select the minimum values in pval1 and the corresponding lag. I also want to select the minimum value in pval2 and the corresponding lag. I would like to do this for each pair of variables (i.e. (A and B), (A and C) and (B and D)). Each pair of variables occurs multiple times in the dataset.

I've tried several approaches to try and get the output I want, but seem to be missing something logic wise and I'm not quite sure what. Any help would be greatly appreciated.

Thankyou to anyone who helps!

The dataframe looks something like this :

myxdf = pd.DataFrame({
    'pval1': [0.01,0.2,0.001,0.3,0.0003,0.05,1,0.002,0.2],
    'pval2': [0.3,0.02,0.002,0.9,0.001,0.002,0.10,0.93,0.00001],
    'lag': [1,2,3,1,2,3,1,2,3],
    'var1': ['A','A','A','A','A','A','B','B','B'],
    'var2': ['B','B','B','C','C','C','D','D','D']
})
    
myxdf

My desired output for the above example should ideally look like this (please note the new lag1 and lag2 columns):

myxdf2 = pd.DataFrame({
    'pval1': [0.0010,0.0003,0.002],
    'pval2' : [0.002,0.001,0.00001],
    'lagp1': ['3','2','2'],
    'lagp2': ['3','2','3'],
    'var1': ['A','A','B'],
    'var2': ['B','C','D']
})

myxdf2

Upvotes: 1

Views: 1152

Answers (1)

jezrael
jezrael

Reputation: 862661

I believe you need DataFrameGroupBy.idxmin for indices of minimal values, use it for select rows, rename columns and join by concat:

df = myxdf.groupby(['var1','var2'])[['pval1', 'pval2']].idxmin()


df1 = myxdf.loc[df['pval1'], ['pval1','lag']].rename(columns={'lag':'lagp1'})
df2 = myxdf.loc[df['pval2'], ['pval2','lag','var1','var2']].rename(columns={'lag':'lagp2'})

df = pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=1)
cols = ['pval1', 'pval2', 'lagp1', 'lagp2', 'var1', 'var2']
df = df[cols]
print (df)
    pval1    pval2  lagp1  lagp2 var1 var2
0  0.0010  0.00200      3      3    A    B
1  0.0003  0.00100      2      2    A    C
2  0.0020  0.00001      2      3    B    D

Upvotes: 1

Related Questions