Reputation: 10051
Given a dataframe as follows:
firstname lastname email \
0 Kieron Futter [email protected]
1 Vinsonn Law [email protected]
2 Rayan Vanderhoof [email protected]
3 Andy Joiner [email protected]
4 Christine Nichols [email protected]
5 Bo Smith [email protected]
6 Rebecca Ford [email protected]
7 Fatima Sheikh [email protected]
8 Zack Scriven [email protected]
9 Bara Alomari [email protected]
companyname
0 ascendishealth.com
1 Carestream
2 Olympus America Inc.
3 Tepha Inc.
4 Prosetta Corp.
5 Innovive, Inc.
6 catholiccharitiesswks.org
7 broomestreetsociety
8 soffaelectric
9 playhut.com
How could I find the top 3 most common email patterns ([email protected], [email protected], [email protected], [email protected],[email protected], [email protected], [email protected], [email protected], [email protected]) by comparing the value in the email
column to firstname
and lastname
columns?
I have used df['name_email'] = df.email.str.split('@', expand = True)[0]
to extract names from email address.
Out:
0 douglas.watson
1 nick.holekamp
2 rob.schriener
3 austin.phillips
4 egeiger
...
995 thanley
996 cmarks
997 darryl.rickner
998 lalit
999 parul.dutt
Thanks.
EDIT:
The error raised from @Stef's code:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/usr/local/lib/python3.7/site-packages/pandas/core/series.py in _try_kind_sort(arr)
2947 # if kind==mergesort, it can fail for object dtype
-> 2948 return arr.argsort(kind=kind)
2949 except TypeError:
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-25-a939f85d610f> in <module>
7 df['f.last'] = df.firstname.str.lower()[0] + '.' + df.lastname.str.lower() == df.name_email
8
----> 9 print(df.iloc[:,4:].sum().sort_values(ascending=False))
/usr/local/lib/python3.7/site-packages/pandas/core/series.py in sort_values(self, axis, ascending, inplace, kind, na_position, ignore_index)
2960 idx = ibase.default_index(len(self))
2961
-> 2962 argsorted = _try_kind_sort(arr[good])
2963
2964 if is_list_like(ascending):
/usr/local/lib/python3.7/site-packages/pandas/core/series.py in _try_kind_sort(arr)
2950 # stable sort not available for object dtype
2951 # uses the argsort default quicksort
-> 2952 return arr.argsort(kind="quicksort")
2953
2954 arr = self._values
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'
The result of df.iloc[:,4:].info()
Out:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9937 entries, 0 to 9999
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 companyname 9937 non-null object
1 industry 9937 non-null object
2 level 9624 non-null object
3 primarydomain 9937 non-null object
4 twitterid 225 non-null object
5 facebookid 11 non-null object
6 linkedinid 2564 non-null object
7 industry.1 9937 non-null object
8 companysize 7538 non-null object
9 companyrevenue 7596 non-null object
10 city 8773 non-null object
11 state 7936 non-null object
12 dept 8865 non-null object
13 phonenumber 41 non-null object
14 net_name 9937 non-null object
15 domain_name 9937 non-null object
16 email 9937 non-null object
17 name_email1 9937 non-null object
18 name_email 9937 non-null object
19 first 9937 non-null bool
20 firstlast 9937 non-null bool
21 first.last 9937 non-null bool
22 last 9937 non-null bool
23 f.last 9937 non-null bool
dtypes: bool(5), object(19)
memory usage: 1.9+ MB
Upvotes: 1
Views: 591
Reputation: 30609
You could add columns for all possible combinations and then count the hits:
import pandas as pd
df = pd.DataFrame({ 'firstname': ['Kieron', 'Vinsonn', 'Rayan', 'Andy', 'Christine', 'Bo', 'Rebecca', 'Fatima', 'Zack', 'Bara'],
'lastname': ['Futter', 'Law', 'Vanderhoof', 'Joiner', 'Nichols', 'Smith', 'Ford', 'Sheikh', 'Scriven', 'Alomari'],
'email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']})
df['name_email'] = df.email.str.lower().str.split('@', expand = True)[0]
df['first'] = df.firstname.str.lower() == df.name_email
df['firstlast'] = df.firstname.str.lower() + df.lastname.str.lower() == df.name_email
df['first.last'] = df.firstname.str.lower() + '.' + df.lastname.str.lower() == df.name_email
df['last'] = df.lastname.str.lower() == df.name_email
df['f.last'] = df.firstname.str.lower()[0] + '.' + df.lastname.str.lower() == df.name_email
# ... etc. ...
print(df.iloc[:,4:].sum().sort_values(ascending=False))
Result:
first.last 4
first 2
f.last 1
last 0
firstlast 0
To make it independent of the position of the newly added columns, you can also use:
df.select_dtypes(include='bool').sum().sort_values(ascending=False)
Upvotes: 4