Groupby and filter based on MAX value of a one column AND CONDITIONALLY topN values of another column

Question

I have a following data:

    individual  groupID  choice     probA     probB
0      9710535        0       0  0.025589  0.008647
1      9710535        2       0  0.036252  0.014684
2      9710535        2       0  0.036252  0.013894
3      9710535        1       0  0.090057  0.030648
4      9710535        1       0  0.090057  0.014804
5      9710535        5       0  0.512675  0.021444
6      9710535        5       0  0.512675  0.020298
7      9710535        5       0  0.512675  0.163837
8      9710535        5       0  0.512675  0.085191
9      9710556        1       1  0.034381  0.796679
10     9710556        1       0  0.034381  0.796679
11     9710556        1       0  0.034381  0.796679
12     9710556        1       0  0.034381  0.157622
13     9710556        1       0  0.034381  0.157622
14     9710556        0       0  0.025589  0.008647

I want to filter data based for every individual on probA column's max value per each groupID, which is groupID = 5 for individual 9710535. However, if that group's size is less than 5, which is the case here (it has only 4 elements), I also want to have rows from other groups top(5 - that group's size) based on column probB.

End result should look like this:

  individual  groupID  choice     probA     probB
0     9710535        1       0  0.090057  0.030648
1     9710535        5       0  0.512675  0.021444
2     9710535        5       0  0.512675  0.020298
3     9710535        5       0  0.512675  0.163837
4     9710535        5       0  0.512675  0.085191
5     9710556        1       1  0.034381  0.796679
6     9710556        1       0  0.034381  0.796679
7     9710556        1       0  0.034381  0.796679
8     9710556        1       0  0.034381  0.157622
9     9710556        1       0  0.034381  0.157622

I am working on a custom function approach with this pseudo code:

def custom_filter(df, groub_by = 'individual', 
                  condition_column1 = 'probA', 
                  condition_column2 = 'probB',
                  top_count = 5)
    return filtered_df

I would appreciate any help!

jezrael · Accepted Answer

I think you need first extract all maximal rows by GroupBy.transform with max and compare groupID with filter by boolean indexing:

mask = df['groupID'].eq(df.groupby('individual')['groupID'].transform('max'))
df1 = df[mask]
print (df1)
    individual  groupID  choice     probA     probB
5      9710535        5       0  0.512675  0.021444
6      9710535        5       0  0.512675  0.020298
7      9710535        5       0  0.512675  0.163837
8      9710535        5       0  0.512675  0.085191
9      9710556        1       1  0.034381  0.796679
10     9710556        1       0  0.034381  0.796679
11     9710556        1       0  0.034381  0.796679
12     9710556        1       0  0.034381  0.157622
13     9710556        1       0  0.034381  0.157622

Then filter out non matched rows by Index.isin and sorting by 2 columns with DataFrame.sort_values:

df2 = (df[~df.index.isin(df1.index)]
           .sort_values(['individual','probB'], 
                        ascending=[True, False]))
print (df2)
    individual  groupID  choice     probA     probB
3      9710535        1       0  0.090057  0.030648
4      9710535        1       0  0.090057  0.014804
1      9710535        2       0  0.036252  0.014684
2      9710535        2       0  0.036252  0.013894
0      9710535        0       0  0.025589  0.008647
14     9710556        0       0  0.025589  0.008647

Last join together by concat, and get top 5 by GroupBy.head and last sorting by another 2 columns:

df = (pd.concat([df1, df2])
        .groupby('individual').head()
        .sort_values(['individual','groupID']))
print (df)
    individual  groupID  choice     probA     probB
3      9710535        1       0  0.090057  0.030648
5      9710535        5       0  0.512675  0.021444
6      9710535        5       0  0.512675  0.020298
7      9710535        5       0  0.512675  0.163837
8      9710535        5       0  0.512675  0.085191
9      9710556        1       1  0.034381  0.796679
10     9710556        1       0  0.034381  0.796679
11     9710556        1       0  0.034381  0.796679
12     9710556        1       0  0.034381  0.157622
13     9710556        1       0  0.034381  0.157622

Groupby and filter based on MAX value of a one column AND CONDITIONALLY topN values of another column

Answers (1)

Related Questions