Reputation: 252
I have a following data:
individual groupID choice probA probB
0 9710535 0 0 0.025589 0.008647
1 9710535 2 0 0.036252 0.014684
2 9710535 2 0 0.036252 0.013894
3 9710535 1 0 0.090057 0.030648
4 9710535 1 0 0.090057 0.014804
5 9710535 5 0 0.512675 0.021444
6 9710535 5 0 0.512675 0.020298
7 9710535 5 0 0.512675 0.163837
8 9710535 5 0 0.512675 0.085191
9 9710556 1 1 0.034381 0.796679
10 9710556 1 0 0.034381 0.796679
11 9710556 1 0 0.034381 0.796679
12 9710556 1 0 0.034381 0.157622
13 9710556 1 0 0.034381 0.157622
14 9710556 0 0 0.025589 0.008647
I want to filter data based for every individual
on probA
column's max value per each groupID
, which is groupID = 5
for individual
9710535. However, if that group's size is less than 5, which is the case here (it has only 4 elements), I also want to have rows from other groups top(5 - that group's size) based on column probB
.
End result should look like this:
individual groupID choice probA probB
0 9710535 1 0 0.090057 0.030648
1 9710535 5 0 0.512675 0.021444
2 9710535 5 0 0.512675 0.020298
3 9710535 5 0 0.512675 0.163837
4 9710535 5 0 0.512675 0.085191
5 9710556 1 1 0.034381 0.796679
6 9710556 1 0 0.034381 0.796679
7 9710556 1 0 0.034381 0.796679
8 9710556 1 0 0.034381 0.157622
9 9710556 1 0 0.034381 0.157622
I am working on a custom function approach with this pseudo code:
def custom_filter(df, groub_by = 'individual',
condition_column1 = 'probA',
condition_column2 = 'probB',
top_count = 5)
return filtered_df
I would appreciate any help!
Upvotes: 1
Views: 607
Reputation: 862911
I think you need first extract all maximal rows by GroupBy.transform
with max
and compare groupID
with filter by boolean indexing
:
mask = df['groupID'].eq(df.groupby('individual')['groupID'].transform('max'))
df1 = df[mask]
print (df1)
individual groupID choice probA probB
5 9710535 5 0 0.512675 0.021444
6 9710535 5 0 0.512675 0.020298
7 9710535 5 0 0.512675 0.163837
8 9710535 5 0 0.512675 0.085191
9 9710556 1 1 0.034381 0.796679
10 9710556 1 0 0.034381 0.796679
11 9710556 1 0 0.034381 0.796679
12 9710556 1 0 0.034381 0.157622
13 9710556 1 0 0.034381 0.157622
Then filter out non matched rows by Index.isin
and sorting by 2 columns with DataFrame.sort_values
:
df2 = (df[~df.index.isin(df1.index)]
.sort_values(['individual','probB'],
ascending=[True, False]))
print (df2)
individual groupID choice probA probB
3 9710535 1 0 0.090057 0.030648
4 9710535 1 0 0.090057 0.014804
1 9710535 2 0 0.036252 0.014684
2 9710535 2 0 0.036252 0.013894
0 9710535 0 0 0.025589 0.008647
14 9710556 0 0 0.025589 0.008647
Last join together by concat
, and get top 5 by GroupBy.head
and last sorting by another 2 columns:
df = (pd.concat([df1, df2])
.groupby('individual').head()
.sort_values(['individual','groupID']))
print (df)
individual groupID choice probA probB
3 9710535 1 0 0.090057 0.030648
5 9710535 5 0 0.512675 0.021444
6 9710535 5 0 0.512675 0.020298
7 9710535 5 0 0.512675 0.163837
8 9710535 5 0 0.512675 0.085191
9 9710556 1 1 0.034381 0.796679
10 9710556 1 0 0.034381 0.796679
11 9710556 1 0 0.034381 0.796679
12 9710556 1 0 0.034381 0.157622
13 9710556 1 0 0.034381 0.157622
Upvotes: 2