Reputation: 10051
Given a dataframe as follows:
firstname lastname email_address \
0 Doug Watson [email protected]
1 Nick Holekamp [email protected]
2 Rob Schreiner [email protected]
3 Austin Phillips [email protected]
4 Elise Geiger [email protected]
5 Paul Urick [email protected]
6 Michael Obringer [email protected]
7 Craig Heneghan [email protected]
8 Kathy Hirst [email protected]
9 Stefan Bluemmers [email protected]
companyname
0 Dignity Health
1 Ranken Jordan Pediatric Bridge Hospital
2 WellStar Health System
3 Precision Medical Products, Inc.
4 puracap.com
5 Diplomat Specialty Pharmacy
6 Lash Group
7 West-Ward Pharmaceuticals
8 Sunovion Pharmaceuticals
9 Grünenthal Group
How could I create possible email addresses using common email patterns as such: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
, etc.
df['email1'] = df.firstname.str.lower() + '.' + df.lastname.str.lower() + '@' + df.companyname.str.replace('\s+', '').str.lower() + '.com'
print(df['email1'])
Out:
0 [email protected]
1 nick.holekamp@rankenjordanpediatricbridgehospi... --->problematic
2 [email protected]
3 austin.phillips@precisionmedicalproducts,inc..com --->problematic
4 [email protected] --->problematic
...
9995 [email protected]
9996 [email protected]
9997 [email protected]
9998 [email protected]
9999 [email protected]
Some of them seems quite problematic, anyone could help to solve this issue? Thanks a lot.
EDITED:
print(df)
after applying @Sajith Herath's solution:
Out:
firstname lastname companyname \
0 Nick Holekamp Ranken ...
email
0 nick. ...
Upvotes: 1
Views: 867
Reputation: 1052
You can use a method to create permutations of username with different separators and define a max length that simplify the domain using company name as follows
import pandas as pd
import random
data = {"firstname":["Nick"],"lastname":["Holekamp"],"companyname":["Ranken \
Jordan Pediatric Bridge Hospital"]}
df = pd.DataFrame(data=data)
max_char = 5
emails = []
def simplify_domain(text):
if len(text)>max_char:
text = ''.join([c for c in text if c.isupper()])
return text.lower()
return text.replace("\s+","").lower()
def username_permutations(first_name,last_name):
# define separators
separators = [".", "_", "-"]
#lower case
combinations = list(map(lambda x:f"{first_name.lower()}{x} \
{last_name.lower()}",separators))
#append a random number to tail
n = random.randint(1, 100)
combinations.extend(list(map(lambda x:f"{x}{n}",combinations)))
return combinations
for index,row in df.iterrows():
usernames = username_permutations(row["firstname"],row["lastname"])
email_permutations = list(map(lambda x: f" \
{x}@{simplify_domain(row['companyname'])}.com",usernames))
emails.append(','.join(email_permutations))
df["email"] = emails
Final result will be [email protected],[email protected],[email protected],[email protected],[email protected],[email protected]
you can modify simplify_domain
method to validate given string such as removing inc
or .com
values
Upvotes: 1