Reputation: 7265
This seems have multiple purpose by my machine learning project, it can be count duplication, and can be used as feature extraction as well, luckily can be use to both numerical and categoric, Ridit Analysys
My data seems to much duplication, and I want to check this. Here's my data
No feature_1 feature_2 feature_3
1. 67 45 56
2. 67 40 56
3. 67 40 51
Here's what I want
No feature_1 feature_2 feature_3 duplication_1 duplication_2 duplication_3
1. 67 45 56 3 1 2
2. 67 40 56 3 2 2
3. 67 40 51 3 2 1
What I did is
df1 = df.groupby(['feature_1']).size().reset_index()
df1.columns = ['customer_id', 'duplication_1']
df = df.merge(df1, on='customer_id', how='left')
df2 = df.groupby(['feature_2']).size().reset_index()
df2.columns = ['customer_id', 'duplication_2']
df = df.merge(df2, on='customer_id', how='left')
df3 = df.groupby(['feature_3']).size().reset_index()
df3.columns = ['customer_id', 'duplication_3']
df = df.merge(df3, on='customer_id', how='left')
But I looking for better alternative for faster way, especially if we have tons of features
Upvotes: 1
Views: 156
Reputation: 863301
Use map
with value_counts
or transform
for each column:
for i, x in enumerate(df.columns):
df['duplication_{}'.format(i + 1)] = df[x].map(df[x].value_counts())
#alternative
#df['duplication_{}'.format(i + 1)] = df.groupby(x)[x].transform('size')
print (df)
feature_1 feature_2 feature_3 duplication_1 duplication_2 \
No
1.0 67 45 56 3 1
2.0 67 40 56 3 2
3.0 67 40 51 3 2
duplication_3
No
1.0 2
2.0 2
3.0 1
Upvotes: 1