Reputation: 273
I have a list contained in each row and I would like to delete duplicated element by keeping the highest value from a score.
here is my data from data frame df1
pair score
0 [A , A ] 1.0000
1 [A , F ] 0.9990
2 [A , G ] 0.9985
3 [A , G ] 0.9975
4 [A , H ] 0.9985
5 [A , H ] 0.9990
I would like to see the result as
pair score
0 [A , A ] 1.0000
1 [A , F ] 0.9990
2 [A , G ] 0.9985
4 [A , H ] 0.9990
I have tried to use group by and set a score = max, but it's not working
Upvotes: 3
Views: 1598
Reputation: 8273
Make new column pair2
with sorted values of string type and then drop duplicates
Will handle if pair have value [A,G]
and [G,A]
treating them same
df['pair2']=df.pair.map(sorted).astype(str)
df.sort_values('score',ascending=False).drop_duplicates('pair2',keep='first').drop('pair2',axis=1).reset_index(drop=True)
Ouput:
pair score
[A, A] 1.0000
[A, F] 0.9990
[A, H] 0.9990
[A, G] 0.9985
Upvotes: 0
Reputation: 862581
First I think working with list
s in pandas is not good idea.
Solution working if convert lists to helper column with tuples - then sort_values
with drop_duplicates
:
df['new'] = df.pair.apply(tuple)
df = df.sort_values('score', ascending=False).drop_duplicates('new')
print (df)
pair score new
0 [A, A] 1.0000 (A, A)
1 [A, F] 0.9990 (A, F)
5 [A, H] 0.9990 (A, H)
2 [A, G] 0.9985 (A, G)
Or to 2 new columns:
df[['a', 'b']] = pd.DataFrame(df.pair.values.tolist())
df = df.sort_values('score', ascending=False).drop_duplicates(['a', 'b'])
print (df)
pair score a b
0 [A, A] 1.0000 A A
1 [A, F] 0.9990 A F
5 [A, H] 0.9990 A H
2 [A, G] 0.9985 A G
Upvotes: 1