Reputation: 79
I am working on this dataset looks very similar like below where,
transaction_id customer_id phone email
1 19 12345 [email protected]
2 19 00001 [email protected]
3 Guest 00001 [email protected]
4 22 12345 [email protected]
5 23 78900 [email protected]
The customers under 19, Guest and 22 are actually the same, according to the similar info used in columns phone and email.
As long as, the customer ids for the customer are not unique, my goal is to find similar rows and assign a new unique customer id (to create a new unique customer_id column).
trans_id cust_id phone email unique_id
1 19 12345 [email protected] 1
2 19 00001 [email protected] 1
3 Guest 00001 [email protected] 1
4 22 12345 [email protected] 1
5 23 78900 [email protected] 2
The complicated side is, I can groupby email, or I can groupby email and phone. But I couldn’t grasp all rows, for example transaction number 2 is always being assigned as other unique customer id. I tried this code.
df['unique_id'] = df.groupby(‘phone’).grouper.group_info[0]
I greatly appreciate your time and help.
Upvotes: 1
Views: 963
Reputation: 59519
This seems like a network problem, which can be solved with the help of networkx
. We need to form the network of all cust_id
s that are linked by either email or phone.
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
# Generate group numbers for unique phones and e-mails
df['p_gp'] = df.groupby('phone').ngroup()
df['e_gp'] = df.groupby('email').ngroup()
# This will create all pair-wise combinations customer_ids within the same `gp`
def gen_pairs(df, gp_col):
nwk = df[['customer_id', gp_col]].merge(df[['customer_id', gp_col]], on=gp_col).drop(columns=gp_col)
# Removes duplicates, not necessary and slow
#nwk = nwk.assign(dummy = nwk.apply(frozenset,1)).drop_duplicates('dummy').drop(columns='dummy')
return nwk
# All pair-wise combinations of either e-mail or phone
dfn = pd.concat([gen_pairs(df, 'p_gp'), gen_pairs(df, 'e_gp')])
# Create the graph
G = nx.from_pandas_edgelist(dfn, source='customer_id_x', target='customer_id_y')
# Visualize which users are linked:
ax,fig = plt.subplots(figsize=(4,4))
nx.draw(G, node_size=30, with_labels=True, font_size=15, edge_color='#1f77b4')
plt.draw()
We can get the separate groups and create a dictionary to map to a unique ID.
l = [list(x.nodes()) for x in nx.connected_component_subgraphs(G)]
#[['19', '22', 'Guest'], ['23']]
d = dict((k, i) for i in range(len(l)) for k in l[i])
#{'19': 0, '22': 0, '23': 1, 'Guest': 0}
# Finally map the customer_id with the dictionary
df['unique_id'] = df.customer_id.map(d)
transaction_id customer_id phone email p_gp e_gp unique_id
0 1 19 12345 [email protected] 1 0 0
1 2 19 00001 [email protected] 0 1 0
2 3 Guest 00001 [email protected] 0 0 0
3 4 22 12345 [email protected] 1 0 0
4 5 23 78900 [email protected] 2 2 1
Upvotes: 1