Reputation: 93
I have a data frame with text as one column and its labels as other column. The texts are duplicates with a single label. I want to remove these duplicates and keep the records for only the label specified.
Sample dataframe:
text label
0 great view a
1 great view b
2 good balcony a
3 nice service a
4 nice service b
5 nice service c
6 bad rooms f
7 nice restaurant a
8 nice restaurant d
9 nice beach nearby x
10 good casino z
Now if I want to keep the text wherever label a is present and remove only the duplicates. Sample output:
text label
0 great view a
1 good balcony a
2 nice service a
3 bad rooms f
4 nice restaurant a
5 nice beach nearby x
6 good casino z
Thanks in advance!
Upvotes: 2
Views: 64
Reputation: 323226
You can simple try sort_values
before drop_duplicates
, since the df will first ordered by the label
by the order of alpha beta (a>b yield to True)
df=df.sort_values('label').drop_duplicates('text')
Or
df=df.sort_values('label').groupby('text').head(1)
Update
Valuetokeep='a'
df=df.iloc[(df.label!=Valuetokeep).argsort()].drop_duplicates('text')
Upvotes: 1