Reputation: 15
I got a csv file that looks like the below table. For each folder, I wish to return the Image with the highest probability of being a 'Dog'. Each folder can only return one image. If Dog is not present, make 'Cat' with the highest probability the primary image. If there's no Cat, make Bird with the highest probability the primary image and so on.
CSV:
FolderName ImageName Predictions Probabilities
ABC MyPet Dog 0.98
ABC HisPet Cat 0.90
DEF HerPet Bird 0.83
ABC NotPet Dog 0.23
DEF asdf Dog 0.78
DEF M123 Cat 0.19
GHI M123s Cat 0.89
GHI M13 Cat 0.19
I was only able to return the img with the highest probability, How can I Prioritize the Prediction column first then the Probabilities column?
df.loc[df.groupby('FolderName')['Probabilities'].idxmax()]
The code returns
FolderName ImageName Predictions Probabilities
ABC MyPet Dog 0.98
DEF asdf Bird 0.83
GHI M123s Cat 0.89
Desired result:
FolderName ImageName Predictions Probabilities
ABC MyPet Dog 0.98
DEF asdf Dog 0.78
GHI M123s Cat 0.89
Upvotes: 1
Views: 63
Reputation: 402333
This can be done by converting "Predictions" to an ordered Categorical column, then calling sort_values
and drop_duplicates
.
df['Predictions'] = pd.Categorical(
df['Predictions'], categories=['Dog', 'Cat', 'Bird'], ordered=True)
(df.sort_values(['Predictions', 'Probabilities'], ascending=[True, False])
.drop_duplicates('FolderName'))
FolderName ImageName Predictions Probabilities
0 ABC MyPet Dog 0.98
4 DEF asdf Dog 0.78
6 GHI M123s Cat 0.89
Upvotes: 1