Reputation: 47
Please tell me how to get the ImgFileNames whose HashCodes occur more than one time in Python. Note: To retain only the first occurrence and delete the remaining even if the value occurs in-between or last or anywhere.
I have a data frame like below :
ImgFileName HashCodes
Img_0001 - Copy.tif 162a47470f021a60
Img_0001.tif 162a47470f021a60
Img_0002.tif 1b5b5b1aa638dac8
Img_0003.tif adadadadadadadad
Img_0004.tif adadadadadadadad
Img_0005 - Copy.tif a5b8648c8c666670
Img_0005.tif a5b8648c8c666670
Img_0006.tif 71b392da6a699392
Img_0007.tif 71b392da6a699392
Img_0008.tif b1b1f2fa6bf97292
Img_0009.tif 86e82ae4c8b6c9c9
Img_0010 - Copy.tif 86e8aae4c8b6c9c9
Img_0010.tif 86e8aae4c8b6c9c9
And I want the output as below:
ImgFileName HashCodes
Img_0001 - Copy.tif 162a47470f021a60
Img_0003.tif adadadadadadadad
Img_0005 - Copy.tif a5b8648c8c666670
Img_0006.tif 71b392da6a699392
Img_0009.tif 86e82ae4c8b6c9c9
Upvotes: 1
Views: 145
Reputation: 863701
You need boolean indexing
with duplicated
- first filter all dupes and second filter last value of dupe or first value of dupe (keep='last'
):
df =df[ df.duplicated('HashCodes', keep=False) & df.duplicated('HashCodes')]
print (df)
ImgFileName HashCodes
1 Img_0001.tif 162a47470f021a60
4 Img_0004.tif adadadadadadadad
6 Img_0005.tif a5b8648c8c666670
8 Img_0007.tif 71b392da6a699392
12 Img_0010.tif 86e8aae4c8b6c9c9
Or:
df =df[ df.duplicated('HashCodes', keep=False) & df.duplicated('HashCodes', keep='last')]
print (df)
ImgFileName HashCodes
0 Img_0001 -Copy.tif 162a47470f021a60
3 Img_0003.tif adadadadadadadad
5 Img_0005 -Copy.tif a5b8648c8c666670
7 Img_0006.tif 71b392da6a699392
11 Img_0010 -Copy.tif 86e8aae4c8b6c9c9
Upvotes: 1