Nabih Bawazir
Nabih Bawazir

Reputation: 7265

How detect feature duplication on pandas

Here's my data

Id   feature1  feature2  feature3 feature4 feature5 feature6
1           4         5         7        7        4        5
2           5         6         8        8        5        5

What I want is duplicated data is removed

Id   feature1  feature2  feature3 feature6
1           4         5         7        5
2           5         6         8        5

Better if duplication is describe as well

feature3 is same with feature4
feature2 is same with feature5

Usually, I use seaboarn corplot, but I get gen confusing when features grow more than 100

import seaborn as sns
ax = sns.heatmap(df)

Upvotes: 2

Views: 46

Answers (3)

BENY
BENY

Reputation: 323356

You can using T then groupby value , notice drop_duplicates and duplicated , will not provide the pairs , which means they just give back the duplicated value(not duplicated group)

s=df.T.reset_index().groupby([0,1])['index'].apply(tuple)
s[s.str.len()>=2].apply(lambda  x : '{0[0]} is same with {0[1]}'.format(x))
Out[797]: 
0  1
4  5    feature1 is same with feature5
7  8    feature3 is same with feature4
Name: index, dtype: object

Upvotes: 2

Olzhas Arystanov
Olzhas Arystanov

Reputation: 986

Possible solution with drop_duplicates() method. However, it looks for rows, so you should apply it to your transposed dataframe and then transpose the result again. Example:

data = [
    [4, 5, 7, 7, 4, 5],
    [5, 6, 8, 8, 5, 5],
     ]

columns=['feature1', 'feature2', 'feature3', 'feature4', 'feature5', 'feature6']

df = pd.DataFrame(data, columns)

df.T.drop_duplicates().T

In order to show which features are duplicated, you can use duplicated() method

df.T.duplicated().T

will show:

feature1    False
feature2    False
feature3    False
feature4     True
feature5     True
feature6    False
dtype: bool

Upvotes: 0

user3483203
user3483203

Reputation: 51165

You could use df.T to transpose your dataframe, use drop_duplicates, and then tranpose your dataframe once more:

In [6]: df.T.drop_duplicates().T
Out[6]:
   Id  feature1  feature2  feature3  feature6
0   1         4         5         7         5
1   2         5         6         8         5

Upvotes: 2

Related Questions