Nabih Bawazir
Nabih Bawazir

Reputation: 7265

How to count feature duplication (or Ridit feature engineering) individually on pandas

This seems have multiple purpose by my machine learning project, it can be count duplication, and can be used as feature extraction as well, luckily can be use to both numerical and categoric, Ridit Analysys

My data seems to much duplication, and I want to check this. Here's my data

No   feature_1    feature_2   feature_3
1.          67           45          56 
2.          67           40          56
3.          67           40          51

Here's what I want

No   feature_1    feature_2   feature_3    duplication_1    duplication_2   duplication_3
1.          67           45          56                3                1               2
2.          67           40          56                3                2               2
3.          67           40          51                3                2               1

What I did is

df1 = df.groupby(['feature_1']).size().reset_index()
df1.columns = ['customer_id', 'duplication_1']
df = df.merge(df1, on='customer_id', how='left')
df2 = df.groupby(['feature_2']).size().reset_index()
df2.columns = ['customer_id', 'duplication_2']
df = df.merge(df2, on='customer_id', how='left')
df3 = df.groupby(['feature_3']).size().reset_index()
df3.columns = ['customer_id', 'duplication_3']
df = df.merge(df3, on='customer_id', how='left')

But I looking for better alternative for faster way, especially if we have tons of features

Upvotes: 1

Views: 156

Answers (1)

jezrael
jezrael

Reputation: 863301

Use map with value_counts or transform for each column:

for i, x in enumerate(df.columns):
    df['duplication_{}'.format(i + 1)] = df[x].map(df[x].value_counts())
    #alternative
    #df['duplication_{}'.format(i + 1)] = df.groupby(x)[x].transform('size')
print (df)
     feature_1  feature_2  feature_3  duplication_1  duplication_2  \
No                                                                   
1.0         67         45         56              3              1   
2.0         67         40         56              3              2   
3.0         67         40         51              3              2   

     duplication_3  
No                  
1.0              2  
2.0              2  
3.0              1  

Upvotes: 1

Related Questions