Ssank
Ssank

Reputation: 3667

subset of pandas dataframe

I have a dataframe df like

Sample  Percentage  Attribute1  Attribute2
1_A       12.3         xxxx        yyyy    
1_A       5.0         aaaa        bbbb    
2_B       10           ccccc       ddddd

For each sample, I want to choose the row which has the maximum value in Percentage and make a dataframe df1. If there exist multiple percentages for a sample (such as 1_A), I would like to make a dataframe df2 in which all the rows other than the one that was chosen for df1. In the example above: df1 will have rows of Samples 1_A and 2_B and df2 will have row from Sample 2B. I tried groupby('Percentage') followed by apply(list) but was not successful for df1.

Upvotes: 0

Views: 81

Answers (2)

BENY
BENY

Reputation: 323236

IIUC drop_duplicates

df.sort_values('Percentage').drop_duplicates('Sample',keep='last')
Out[1046]: 
  Sample  Percentage Attribute1 Attribute2
2    2_B        10.0      ccccc      ddddd
0    1_A        12.3       xxxx       yyyy

Upvotes: 2

sacuL
sacuL

Reputation: 51335

I would do this by first sorting the dataframe by Percentage, so the higher percentages are at the bottom. Then, groupby the Sample column, and take the last row using tail(1) to create df1. Then, to create df2, just find the rows of df whose indices are not in df1:

df1 = df.sort_values('Percentage').groupby('Sample').tail(1)

df2 = df[~df.index.isin(df1.index)]

>>> df1
  Sample  Percentage Attribute1 Attribute2
2    2_B        10.0      ccccc      ddddd
0    1_A        12.3       xxxx       yyyy
>>> df2
  Sample  Percentage Attribute1 Attribute2
1    1_A         5.0       aaaa       bbbb

Upvotes: 0

Related Questions