Pythonista
Pythonista

Reputation: 79

Groupby two column values and create a unique id

I am working on this dataset looks very similar like below where,

transaction_id   customer_id   phone           email
             1            19   12345   [email protected]          
             2            19   00001   [email protected]
             3         Guest   00001   [email protected]
             4            22   12345   [email protected]
             5            23   78900   [email protected]     

The customers under 19, Guest and 22 are actually the same, according to the similar info used in columns phone and email.

As long as, the customer ids for the customer are not unique, my goal is to find similar rows and assign a new unique customer id (to create a new unique customer_id column).

trans_id   cust_id   phone           email  unique_id
       1        19   12345   [email protected]          1        
       2        19   00001   [email protected]          1
       3     Guest   00001   [email protected]          1
       4        22   12345   [email protected]          1
       5        23   78900   [email protected]          2

The complicated side is, I can groupby email, or I can groupby email and phone. But I couldn’t grasp all rows, for example transaction number 2 is always being assigned as other unique customer id. I tried this code.

 df['unique_id'] = df.groupby(‘phone’).grouper.group_info[0] 

I greatly appreciate your time and help.

Upvotes: 1

Views: 963

Answers (1)

ALollz
ALollz

Reputation: 59519

This seems like a network problem, which can be solved with the help of networkx. We need to form the network of all cust_ids that are linked by either email or phone.

import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

# Generate group numbers for unique phones and e-mails
df['p_gp'] = df.groupby('phone').ngroup()
df['e_gp'] = df.groupby('email').ngroup()

# This will create all pair-wise combinations customer_ids within the same `gp`
def gen_pairs(df, gp_col):
    nwk = df[['customer_id', gp_col]].merge(df[['customer_id', gp_col]], on=gp_col).drop(columns=gp_col)
    # Removes duplicates, not necessary and slow
    #nwk = nwk.assign(dummy = nwk.apply(frozenset,1)).drop_duplicates('dummy').drop(columns='dummy')
    return nwk

# All pair-wise combinations of either e-mail or phone
dfn = pd.concat([gen_pairs(df, 'p_gp'), gen_pairs(df, 'e_gp')])

# Create the graph
G = nx.from_pandas_edgelist(dfn, source='customer_id_x', target='customer_id_y')

# Visualize which users are linked:
ax,fig = plt.subplots(figsize=(4,4))
nx.draw(G, node_size=30, with_labels=True, font_size=15, edge_color='#1f77b4')
plt.draw()

enter image description here

We can get the separate groups and create a dictionary to map to a unique ID.

l = [list(x.nodes()) for x in nx.connected_component_subgraphs(G)]
#[['19', '22', 'Guest'], ['23']]

d = dict((k, i) for i in range(len(l)) for k in l[i])
#{'19': 0, '22': 0, '23': 1, 'Guest': 0}

# Finally map the customer_id with the dictionary
df['unique_id'] = df.customer_id.map(d)

  transaction_id customer_id  phone          email  p_gp  e_gp  unique_id
0              1          19  12345  [email protected]     1     0          0
1              2          19  00001  [email protected]     0     1          0
2              3       Guest  00001  [email protected]     0     0          0
3              4          22  12345  [email protected]     1     0          0
4              5          23  78900  [email protected]     2     2          1

Upvotes: 1

Related Questions