Find most common email patterns based on email addresses and names in Python

Question

Given a dataframe as follows:

   firstname    lastname                             email  \
0     Kieron      Futter  kieron.futter@ascendishealth.com   
1    Vinsonn         Law        vinsonn.law@carestream.com   
2      Rayan  Vanderhoof      rayan.vanderhoof@olympus.com   
3       Andy      Joiner                    andy@tepha.com   
4  Christine     Nichols             cnichols@prosetta.com   
5         Bo       Smith            bsmith@innoviveinc.com   
6    Rebecca        Ford   rford@catholiccharitiesswks.org   
7     Fatima      Sheikh    fatima@broomestreetsociety.com   
8       Zack     Scriven    zack.scriven@soffaelectric.com   
9       Bara     Alomari                 baraa@playhut.com   

                 companyname  
0         ascendishealth.com  
1                 Carestream  
2       Olympus America Inc.  
3                 Tepha Inc.  
4             Prosetta Corp.  
5             Innovive, Inc.  
6  catholiccharitiesswks.org  
7        broomestreetsociety  
8              soffaelectric  
9                playhut.com

How could I find the top 3 most common email patterns (first@example.com, firstlast@example.com, first.last@example.com, last@example.com,first@example.com, f.last@example.com, lastF@example.com, first_last@example.com, firstL@example.com) by comparing the value in the email column to firstname and lastname columns?

I have used df['name_email'] = df.email.str.split('@', expand = True)[0] to extract names from email address.

Out:

0        douglas.watson
1         nick.holekamp
2         rob.schriener
3       austin.phillips
4               egeiger
             ...       
995             thanley
996              cmarks
997      darryl.rickner
998               lalit
999          parul.dutt

Thanks.

EDIT:

The error raised from @Stef's code:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.7/site-packages/pandas/core/series.py in _try_kind_sort(arr)
   2947                 # if kind==mergesort, it can fail for object dtype
-> 2948                 return arr.argsort(kind=kind)
   2949             except TypeError:

TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
 in 
      7 df['f.last'] = df.firstname.str.lower()[0] + '.' + df.lastname.str.lower() == df.name_email
      8 
----> 9 print(df.iloc[:,4:].sum().sort_values(ascending=False))

/usr/local/lib/python3.7/site-packages/pandas/core/series.py in sort_values(self, axis, ascending, inplace, kind, na_position, ignore_index)
   2960         idx = ibase.default_index(len(self))
   2961 
-> 2962         argsorted = _try_kind_sort(arr[good])
   2963 
   2964         if is_list_like(ascending):

/usr/local/lib/python3.7/site-packages/pandas/core/series.py in _try_kind_sort(arr)
   2950                 # stable sort not available for object dtype
   2951                 # uses the argsort default quicksort
-> 2952                 return arr.argsort(kind="quicksort")
   2953 
   2954         arr = self._values

TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'

The result of df.iloc[:,4:].info()

Out:


Int64Index: 9937 entries, 0 to 9999
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   companyname     9937 non-null   object
 1   industry        9937 non-null   object
 2   level           9624 non-null   object
 3   primarydomain   9937 non-null   object
 4   twitterid       225 non-null    object
 5   facebookid      11 non-null     object
 6   linkedinid      2564 non-null   object
 7   industry.1      9937 non-null   object
 8   companysize     7538 non-null   object
 9   companyrevenue  7596 non-null   object
 10  city            8773 non-null   object
 11  state           7936 non-null   object
 12  dept            8865 non-null   object
 13  phonenumber     41 non-null     object
 14  net_name        9937 non-null   object
 15  domain_name     9937 non-null   object
 16  email           9937 non-null   object
 17  name_email1     9937 non-null   object
 18  name_email      9937 non-null   object
 19  first           9937 non-null   bool  
 20  firstlast       9937 non-null   bool  
 21  first.last      9937 non-null   bool  
 22  last            9937 non-null   bool  
 23  f.last          9937 non-null   bool  
dtypes: bool(5), object(19)
memory usage: 1.9+ MB

Stef · Accepted Answer

You could add columns for all possible combinations and then count the hits:

import pandas as pd

df = pd.DataFrame({ 'firstname': ['Kieron', 'Vinsonn', 'Rayan', 'Andy', 'Christine', 'Bo', 'Rebecca', 'Fatima', 'Zack', 'Bara'],
                    'lastname': ['Futter', 'Law', 'Vanderhoof', 'Joiner', 'Nichols', 'Smith', 'Ford', 'Sheikh', 'Scriven', 'Alomari'],
                    'email': ['kieron.futter@ascendishealth.com', 'vinsonn.law@carestream.com', 'rayan.vanderhoof@olympus.com', 'andy@tepha.com', 'cnichols@prosetta.com', 'bsmith@innoviveinc.com', 'rford@catholiccharitiesswks.org', 'fatima@broomestreetsociety.com', 'zack.scriven@soffaelectric.com', 'baraa@playhut.com']})

df['name_email'] = df.email.str.lower().str.split('@', expand = True)[0]

df['first'] = df.firstname.str.lower() == df.name_email
df['firstlast'] = df.firstname.str.lower() + df.lastname.str.lower() == df.name_email
df['first.last'] = df.firstname.str.lower() + '.' + df.lastname.str.lower() == df.name_email
df['last'] = df.lastname.str.lower() == df.name_email
df['f.last'] = df.firstname.str.lower()[0] + '.' + df.lastname.str.lower() == df.name_email
# ... etc. ...

print(df.iloc[:,4:].sum().sort_values(ascending=False))

Result:

first.last    4
first         2
f.last        1
last          0
firstlast     0

To make it independent of the position of the newly added columns, you can also use:

df.select_dtypes(include='bool').sum().sort_values(ascending=False)

Find most common email patterns based on email addresses and names in Python

Answers (1)

Related Questions