Adrian Keister
Adrian Keister

Reputation: 1025

Python Pandas: groupby one column, aggregate in only one other column, but take corresponding data

I have seen a number of other related SO questions like this and this, but they do not seem to be exactly what I want. Suppose I have a dataframe like this:

import pandas as pd
df = pd.DataFrame(columns=['patient', 'parent csn', 'child csn', 'days'])
df.loc[0] = [0, 0, 10, 5]
df.loc[1] = [0, 0, 11, 3]
df.loc[2] = [0, 1, 12, 6]
df.loc[3] = [0, 1, 13, 4]
df.loc[4] = [1, 2, 20, 4]
df
Out[9]: 
  patient parent csn child csn days
0       0          0        10    5
1       0          0        11    3
2       0          1        12    6
3       0          1        13    4
4       1          2        20    4

Now what I want to do is something like this:

grp_df = df.groupby(['parent csn']).min()

The problem is that the result computes the min across all columns (that aren't parent csn), and that produces:

grp_df
            patient  child csn  days
parent csn                          
0                 0         10     3
1                 0         12     4
2                 1         20     4

You can see that for the first row, the days number and the child csn number are no longer on the same row, like they were before grouping. Here's the output I want:

grp_df
            patient  child csn  days
parent csn                          
0                 0         11     3
1                 0         13     4
2                 1         20     4

How can I get that? I have code that iterates through the dataframe, and I think it will work, but it is slow as all get-out, even with Cython. I feel like this should be obvious, but I am not finding it so.

I looked at this question as well, but putting the child csn in the groupby list will not work, because child csn varies as days.

This question seems more likely, but I'm not finding the solutions very intuitive.

This question also seems likely, but again, the answers aren't very intuitive, plus I do want only one row for each parent csn.

One other detail: the row containing the minimum days value might not be unique. In that case, I just want one row - I don't care which.

Many thanks for your time!

Upvotes: 4

Views: 3505

Answers (4)

Cameron Riddell
Cameron Riddell

Reputation: 13407

You can do this by using .idxmin() instead of .min() to get the index (row identifier) where "days" is at it minimum for each group:

data creation:

import pandas as pd

data = [[0, 0, 10, 5],
        [0, 0, 11, 3],
        [0, 1, 12, 6],
        [0, 1, 13, 4],
        [1, 2, 20, 4]]
df = pd.DataFrame(data, columns=['patient', 'parent csn', 'child csn', 'days'])

print(df)
   patient  parent csn  child csn  days
0        0           0         10     5
1        0           0         11     3
2        0           1         12     6
3        0           1         13     4
4        1           2         20     4
day_minimum_row_indices = df.groupby("parent csn")["days"].idxmin()

print(day_minimum_row_indices)
parent csn
0    1
1    3
2    4
Name: days, dtype: int64

From this you can see that the group parent csn 0 had a minimum number of days at row 1. Looking back to our original dataframe, we can see that row 1 had days == 3 and is infact the location of the minimum days for parent csn == 0. Parent csn == 1 had a minimum days at row 3, so on and so forth.

We can use the row indices to subset back into our original dataframe:

new_df = df.loc[day_minimum_row_indices]

print(new_df)
   patient  parent csn  child csn  days
1        0           0         11     3
3        0           1         13     4
4        1           2         20     4

Edit (tldr):

df.loc[df.groupby("parent csn")["days"].idxmin()]

Upvotes: 7

Michael Szczesny
Michael Szczesny

Reputation: 5036

For some reason I can't explain your dataframe has columns of type object. This solution only works with numerical columns

df.days = df.days.astype(int)
df.iloc[df.groupby('parent csn').days.idxmin()]

Out:

  patient parent csn child csn  days
1       0          0        11     3
3       0          1        13     4
4       1          2        20     4

Upvotes: 1

Andy L.
Andy L.

Reputation: 25239

As your desired ouput, you need sort_values and groupby first

df_final = (df.sort_values(['parent csn', 'patient', 'days', 'parent csn'])
              .groupby('parent csn').first())

Out[813]:
            patient  child csn  days
parent csn
0                 0         11     3
1                 0         13     4
2                 1         20     4

Upvotes: 4

David Erickson
David Erickson

Reputation: 16673

You can filter by the dataframe for the rows you need using groupby to create the filter rather than just using .groupby:

s = df.groupby('parent csn')['days'].transform('min') == df['days']
df = df[s]
df

Out[1]: 
   patient  parent csn  child csn  days
1        0           0         11     3
3        0           1         13     4
4        1           2         20     4

For example, this is is how it would look like if I put s in my dataframe. Then you just filter for the True rows which are the ones where minimum days per group are equal to that row.

Out[2]: 
   patient  parent csn  child csn  days      s
0        0           0         10     5  False
1        0           0         11     3   True
2        0           1         12     6  False
3        0           1         13     4   True
4        1           2         20     4   True

Upvotes: 1

Related Questions