Reputation: 1025
I have seen a number of other related SO questions like this and this, but they do not seem to be exactly what I want. Suppose I have a dataframe like this:
import pandas as pd
df = pd.DataFrame(columns=['patient', 'parent csn', 'child csn', 'days'])
df.loc[0] = [0, 0, 10, 5]
df.loc[1] = [0, 0, 11, 3]
df.loc[2] = [0, 1, 12, 6]
df.loc[3] = [0, 1, 13, 4]
df.loc[4] = [1, 2, 20, 4]
df
Out[9]:
patient parent csn child csn days
0 0 0 10 5
1 0 0 11 3
2 0 1 12 6
3 0 1 13 4
4 1 2 20 4
Now what I want to do is something like this:
grp_df = df.groupby(['parent csn']).min()
The problem is that the result computes the min across all columns (that aren't parent csn
), and that produces:
grp_df
patient child csn days
parent csn
0 0 10 3
1 0 12 4
2 1 20 4
You can see that for the first row, the days
number and the child csn
number are no longer on the same row, like they were before grouping. Here's the output I want:
grp_df
patient child csn days
parent csn
0 0 11 3
1 0 13 4
2 1 20 4
How can I get that? I have code that iterates through the dataframe, and I think it will work, but it is slow as all get-out, even with Cython. I feel like this should be obvious, but I am not finding it so.
I looked at this question as well, but putting the child csn
in the groupby list will not work, because child csn
varies as days
.
This question seems more likely, but I'm not finding the solutions very intuitive.
This question also seems likely, but again, the answers aren't very intuitive, plus I do want only one row for each parent csn
.
One other detail: the row containing the minimum days
value might not be unique. In that case, I just want one row - I don't care which.
Many thanks for your time!
Upvotes: 4
Views: 3505
Reputation: 13407
You can do this by using .idxmin()
instead of .min()
to get the index (row identifier) where "days" is at it minimum for each group:
data creation:
import pandas as pd
data = [[0, 0, 10, 5],
[0, 0, 11, 3],
[0, 1, 12, 6],
[0, 1, 13, 4],
[1, 2, 20, 4]]
df = pd.DataFrame(data, columns=['patient', 'parent csn', 'child csn', 'days'])
print(df)
patient parent csn child csn days
0 0 0 10 5
1 0 0 11 3
2 0 1 12 6
3 0 1 13 4
4 1 2 20 4
day_minimum_row_indices = df.groupby("parent csn")["days"].idxmin()
print(day_minimum_row_indices)
parent csn
0 1
1 3
2 4
Name: days, dtype: int64
From this you can see that the group parent csn 0 had a minimum number of days at row 1. Looking back to our original dataframe, we can see that row 1 had days == 3 and is infact the location of the minimum days for parent csn == 0. Parent csn == 1 had a minimum days at row 3, so on and so forth.
We can use the row indices to subset back into our original dataframe:
new_df = df.loc[day_minimum_row_indices]
print(new_df)
patient parent csn child csn days
1 0 0 11 3
3 0 1 13 4
4 1 2 20 4
Edit (tldr):
df.loc[df.groupby("parent csn")["days"].idxmin()]
Upvotes: 7
Reputation: 5036
For some reason I can't explain your dataframe has columns of type object
. This solution only works with numerical columns
df.days = df.days.astype(int)
df.iloc[df.groupby('parent csn').days.idxmin()]
Out:
patient parent csn child csn days
1 0 0 11 3
3 0 1 13 4
4 1 2 20 4
Upvotes: 1
Reputation: 25239
As your desired ouput, you need sort_values
and groupby first
df_final = (df.sort_values(['parent csn', 'patient', 'days', 'parent csn'])
.groupby('parent csn').first())
Out[813]:
patient child csn days
parent csn
0 0 11 3
1 0 13 4
2 1 20 4
Upvotes: 4
Reputation: 16673
You can filter by the dataframe for the rows you need using groupby to create the filter rather than just using .groupby:
s = df.groupby('parent csn')['days'].transform('min') == df['days']
df = df[s]
df
Out[1]:
patient parent csn child csn days
1 0 0 11 3
3 0 1 13 4
4 1 2 20 4
For example, this is is how it would look like if I put s
in my dataframe. Then you just filter for the True
rows which are the ones where minimum days per group are equal to that row.
Out[2]:
patient parent csn child csn days s
0 0 0 10 5 False
1 0 0 11 3 True
2 0 1 12 6 False
3 0 1 13 4 True
4 1 2 20 4 True
Upvotes: 1