Reputation: 7265
Here's my data
Id feature1 feature2 feature3 feature4 feature5 feature6
1 4 5 7 7 4 5
2 5 6 8 8 5 5
What I want is duplicated data is removed
Id feature1 feature2 feature3 feature6
1 4 5 7 5
2 5 6 8 5
Better if duplication is describe as well
feature3 is same with feature4
feature2 is same with feature5
Usually, I use seaboarn corplot, but I get gen confusing when features grow more than 100
import seaborn as sns
ax = sns.heatmap(df)
Upvotes: 2
Views: 46
Reputation: 323356
You can using T
then groupby
value , notice drop_duplicates
and duplicated
, will not provide the pairs , which means they just give back the duplicated value(not duplicated group)
s=df.T.reset_index().groupby([0,1])['index'].apply(tuple)
s[s.str.len()>=2].apply(lambda x : '{0[0]} is same with {0[1]}'.format(x))
Out[797]:
0 1
4 5 feature1 is same with feature5
7 8 feature3 is same with feature4
Name: index, dtype: object
Upvotes: 2
Reputation: 986
Possible solution with drop_duplicates() method. However, it looks for rows, so you should apply it to your transposed dataframe and then transpose the result again. Example:
data = [
[4, 5, 7, 7, 4, 5],
[5, 6, 8, 8, 5, 5],
]
columns=['feature1', 'feature2', 'feature3', 'feature4', 'feature5', 'feature6']
df = pd.DataFrame(data, columns)
df.T.drop_duplicates().T
In order to show which features are duplicated, you can use duplicated() method
df.T.duplicated().T
will show:
feature1 False
feature2 False
feature3 False
feature4 True
feature5 True
feature6 False
dtype: bool
Upvotes: 0
Reputation: 51165
You could use df.T
to transpose your dataframe, use drop_duplicates
, and then tranpose your dataframe once more:
In [6]: df.T.drop_duplicates().T
Out[6]:
Id feature1 feature2 feature3 feature6
0 1 4 5 7 5
1 2 5 6 8 5
Upvotes: 2