ah bon
ah bon

Reputation: 10051

Find most common email patterns based on email addresses and names in Python

Given a dataframe as follows:

   firstname    lastname                             email  \
0     Kieron      Futter  [email protected]   
1    Vinsonn         Law        [email protected]   
2      Rayan  Vanderhoof      [email protected]   
3       Andy      Joiner                    [email protected]   
4  Christine     Nichols             [email protected]   
5         Bo       Smith            [email protected]   
6    Rebecca        Ford   [email protected]   
7     Fatima      Sheikh    [email protected]   
8       Zack     Scriven    [email protected]   
9       Bara     Alomari                 [email protected]   

                 companyname  
0         ascendishealth.com  
1                 Carestream  
2       Olympus America Inc.  
3                 Tepha Inc.  
4             Prosetta Corp.  
5             Innovive, Inc.  
6  catholiccharitiesswks.org  
7        broomestreetsociety  
8              soffaelectric  
9                playhut.com  

How could I find the top 3 most common email patterns ([email protected], [email protected], [email protected], [email protected],[email protected], [email protected], [email protected], [email protected], [email protected]) by comparing the value in the email column to firstname and lastname columns?

I have used df['name_email'] = df.email.str.split('@', expand = True)[0] to extract names from email address.

Out:

0        douglas.watson
1         nick.holekamp
2         rob.schriener
3       austin.phillips
4               egeiger
             ...       
995             thanley
996              cmarks
997      darryl.rickner
998               lalit
999          parul.dutt

Thanks.

EDIT:

The error raised from @Stef's code:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.7/site-packages/pandas/core/series.py in _try_kind_sort(arr)
   2947                 # if kind==mergesort, it can fail for object dtype
-> 2948                 return arr.argsort(kind=kind)
   2949             except TypeError:

TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-25-a939f85d610f> in <module>
      7 df['f.last'] = df.firstname.str.lower()[0] + '.' + df.lastname.str.lower() == df.name_email
      8 
----> 9 print(df.iloc[:,4:].sum().sort_values(ascending=False))

/usr/local/lib/python3.7/site-packages/pandas/core/series.py in sort_values(self, axis, ascending, inplace, kind, na_position, ignore_index)
   2960         idx = ibase.default_index(len(self))
   2961 
-> 2962         argsorted = _try_kind_sort(arr[good])
   2963 
   2964         if is_list_like(ascending):

/usr/local/lib/python3.7/site-packages/pandas/core/series.py in _try_kind_sort(arr)
   2950                 # stable sort not available for object dtype
   2951                 # uses the argsort default quicksort
-> 2952                 return arr.argsort(kind="quicksort")
   2953 
   2954         arr = self._values

TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'

The result of df.iloc[:,4:].info()

Out:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9937 entries, 0 to 9999
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   companyname     9937 non-null   object
 1   industry        9937 non-null   object
 2   level           9624 non-null   object
 3   primarydomain   9937 non-null   object
 4   twitterid       225 non-null    object
 5   facebookid      11 non-null     object
 6   linkedinid      2564 non-null   object
 7   industry.1      9937 non-null   object
 8   companysize     7538 non-null   object
 9   companyrevenue  7596 non-null   object
 10  city            8773 non-null   object
 11  state           7936 non-null   object
 12  dept            8865 non-null   object
 13  phonenumber     41 non-null     object
 14  net_name        9937 non-null   object
 15  domain_name     9937 non-null   object
 16  email           9937 non-null   object
 17  name_email1     9937 non-null   object
 18  name_email      9937 non-null   object
 19  first           9937 non-null   bool  
 20  firstlast       9937 non-null   bool  
 21  first.last      9937 non-null   bool  
 22  last            9937 non-null   bool  
 23  f.last          9937 non-null   bool  
dtypes: bool(5), object(19)
memory usage: 1.9+ MB

Upvotes: 1

Views: 591

Answers (1)

Stef
Stef

Reputation: 30609

You could add columns for all possible combinations and then count the hits:

import pandas as pd

df = pd.DataFrame({ 'firstname': ['Kieron', 'Vinsonn', 'Rayan', 'Andy', 'Christine', 'Bo', 'Rebecca', 'Fatima', 'Zack', 'Bara'],
                    'lastname': ['Futter', 'Law', 'Vanderhoof', 'Joiner', 'Nichols', 'Smith', 'Ford', 'Sheikh', 'Scriven', 'Alomari'],
                    'email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']})

df['name_email'] = df.email.str.lower().str.split('@', expand = True)[0]

df['first'] = df.firstname.str.lower() == df.name_email
df['firstlast'] = df.firstname.str.lower() + df.lastname.str.lower() == df.name_email
df['first.last'] = df.firstname.str.lower() + '.' + df.lastname.str.lower() == df.name_email
df['last'] = df.lastname.str.lower() == df.name_email
df['f.last'] = df.firstname.str.lower()[0] + '.' + df.lastname.str.lower() == df.name_email
# ... etc. ...

print(df.iloc[:,4:].sum().sort_values(ascending=False))

Result:

first.last    4
first         2
f.last        1
last          0
firstlast     0

To make it independent of the position of the newly added columns, you can also use:

df.select_dtypes(include='bool').sum().sort_values(ascending=False)

Upvotes: 4

Related Questions