How to drop columns with duplicated row elements in pandas dataframe?

Question

Suppose I have a df:

        id1   id2   id3  id4  id5   
seq1    hey    go  what   go  key  
seq2   done   six   and  six  six  
...

I need to drop columns that contain duplicated words at least in one row (words from different rows are different):

        id1   id3  
seq1    hey  what  
seq2   done   and  
...

Here columns id2 and id4 were deleted because of seq1 and columns id2, id4 and id5 were deleted because of seq2.

Is there any elegant way to do this?

jezrael · Accepted Answer

Use boolean indexing with loc for filter columns:

df = df.loc[:, ~df.apply(lambda x: x.duplicated(keep=False), axis=1).any()]
print (df)
       id1   id3
seq1   hey  what
seq2  done   and

Explanation:

For each row call duplicated function:

print (df.apply(lambda x: x.duplicated(keep=False), axis=1))
        id1   id2    id3   id4    id5
seq1  False  True  False  True  False
seq2  False  True  False  True   True

then check at least one True per column by DataFrame.any:

print (df.apply(lambda x: x.duplicated(keep=False), axis=1).any())
id1    False
id2     True
id3    False
id4     True
id5     True
dtype: bool

Invert boolean mask by ~:

print (~df.apply(lambda x: x.duplicated(keep=False), axis=1).any())
id1     True
id2    False
id3     True
id4    False
id5    False
dtype: bool

How to drop columns with duplicated row elements in pandas dataframe?

Answers (1)

Related Questions