Reputation: 19
This is a follow-on to this question about merging two files with protein data.
When I import dataframes with biopandas
package, I'm unable to get duplicated/drop_duplicates
to drop my duplicates. My dataframe is quite big:
# df:
col1 col2 col3 col4 col5 col6 col7 col8 col9
0 ATOM N SER 15 17.203 0.286 72.985 4pxz
1 ATOM CA SER 15 16.713 1.342 73.869 4pxz
2 ATOM C SER 15 17.885 2.188 74.412 4pxz
3 ATOM O SER 15 18.028 3.351 74.013 4pxz
4 ATOM CB SER 15 15.889 0.750 75.014 4pxz
... ... ... ... ... ... ... ... ...
3 ATOM CD ARG 93 12.319 8.102 61.886 hatp
4 ATOM NE ARG 93 11.978 6.754 61.425 hatp
5 ATOM CZ ARG 93 11.731 5.714 62.217 hatp
6 ATOM NH2 ARG 93 11.430 4.535 61.694 hatp
7 ATOM NH1 ARG 93 11.793 5.843 63.538 hatp
3148 rows × 8 columns
I wanted to check it in scope of duplicates using following:
df2 = df[df.duplicated(['col3','col4','col5'])] # show me duplicates containing identical type(col3), abbreviation(col4) and number(col5).
And I got:
col1 col2 col3 col4 col5 col6 col7 col8
2132 ATOM CA HIS 1063 38.442 -16.479 -5.209 4pxz
2136 ATOM CB HIS 1063 37.502 -15.555 -6.008 4pxz
2138 ATOM CG HIS 1063 38.007 -15.211 -7.378 4pxz
2140 ATOM ND1 HIS 1063 38.342 -16.194 -8.293 4pxz
2142 ATOM CD2 HIS 1063 38.213 -14.000 -7.943 4pxz
2144 ATOM CE1 HIS 1063 38.749 -15.553 -9.375 4pxz
2146 ATOM NE2 HIS 1063 38.688 -14.231 -9.213 4pxz
0 ATOM CA ARG 93 11.357 9.429 58.493 hatp
1 ATOM CB ARG 93 12.236 9.564 59.757 hatp
2 ATOM CG ARG 93 11.569 9.166 61.087 hatp
3 ATOM CD ARG 93 12.319 8.102 61.886 hatp
4 ATOM NE ARG 93 11.978 6.754 61.425 hatp
5 ATOM CZ ARG 93 11.731 5.714 62.217 hatp
6 ATOM NH2 ARG 93 11.430 4.535 61.694 hatp
7 ATOM NH1 ARG 93 11.793 5.843 63.538 hatp
Expected output:
col1 col2 col3 col4 col5 col6 col7 col8 col9
606 ATOM CA ARG 93 11.357 9.429 58.493 4pxz
609 ATOM CB ARG 93 12.236 9.564 59.757 4pxz
610 ATOM CG ARG 93 13.088 8.333 60.120 4pxz
611 ATOM CD ARG 93 13.985 7.822 58.995 4pxz
612 ATOM NE ARG 93 14.503 6.485 59.295 4pxz
613 ATOM CZ ARG 93 15.012 5.642 58.400 4pxz
614 ATOM NH1 ARG 93 15.074 5.979 57.116 4pxz
615 ATOM NH2 ARG 93 15.455 4.453 58.780 4pxz
0 ATOM CA ARG 93 11.357 9.429 58.493 hatp
1 ATOM CB ARG 93 12.236 9.564 59.757 hatp
2 ATOM CG ARG 93 11.569 9.166 61.087 hatp
3 ATOM CD ARG 93 12.319 8.102 61.886 hatp
4 ATOM NE ARG 93 11.978 6.754 61.425 hatp
5 ATOM CZ ARG 93 11.731 5.714 62.217 hatp
6 ATOM NH2 ARG 93 11.430 4.535 61.694 hatp
7 ATOM NH1 ARG 93 11.793 5.843 63.538 hatp
As you can see, it did not follow instructions in duplicated()
method (drop_duplicates
works exactly the same). I needed to use:
df2 = df[df['col5'] == 93]
What is wrong?
Upvotes: 0
Views: 78
Reputation: 638
Isn’t the command df.duplicated
?
Also make sure to pass option keep=False
.
Upvotes: 1
Reputation: 19
Proper answer:
df2 = df[df.duplicated(subset = ['col3','col4','col5'], keep = False)]
Thank your very much guys!
Upvotes: 0