Reputation: 131
Suppose I have a df:
id1 id2 id3 id4 id5
seq1 hey go what go key
seq2 done six and six six
...
I need to drop columns that contain duplicated words at least in one row (words from different rows are different):
id1 id3
seq1 hey what
seq2 done and
...
Here columns id2 and id4 were deleted because of seq1 and columns id2, id4 and id5 were deleted because of seq2.
Is there any elegant way to do this?
Upvotes: 2
Views: 33
Reputation: 862406
Use boolean indexing
with loc
for filter columns:
df = df.loc[:, ~df.apply(lambda x: x.duplicated(keep=False), axis=1).any()]
print (df)
id1 id3
seq1 hey what
seq2 done and
Explanation:
For each row call duplicated
function:
print (df.apply(lambda x: x.duplicated(keep=False), axis=1))
id1 id2 id3 id4 id5
seq1 False True False True False
seq2 False True False True True
then check at least one True
per column by DataFrame.any
:
print (df.apply(lambda x: x.duplicated(keep=False), axis=1).any())
id1 False
id2 True
id3 False
id4 True
id5 True
dtype: bool
Invert boolean mask by ~
:
print (~df.apply(lambda x: x.duplicated(keep=False), axis=1).any())
id1 True
id2 False
id3 True
id4 False
id5 False
dtype: bool
Upvotes: 1