Reputation: 23
I have a code like this. Here, I want to delete the bad characters in the mailing list. I tried to apply it with the bad_chars()
function.
import pandas as pd
import numpy as np
excelRead = pd.read_excel('mailing.xlsx')
excelRead.dropna(inplace= True)
badCharsList = ["ü", "ı", "ö", "ç", "ş", "ğ", "!", "#", "$", "%", "&", "'", "*", "+", "/", "=", "?" "^", "`", "{", "|", "}", "~", "(",")",",",":",";","<",">","[", " "]
def bad_chars(x):
for i in badCharsList:
if i.lower() not in x.lower():
return i
else:
return np.nan
excelTest = excelRead[excelRead['mails'].str.endswith("@gmail.com", na=False) | excelRead['mails'].str.endswith("@hotmail.com", na=False) | excelRead['mails'].str.endswith("@outlook.com", na=False) | excelRead['mails'].str.endswith("@icloud.com", na=False) | excelRead['mails'].str.endswith("@windowslive.com", na=False) | excelRead['mails'].str.endswith("@yandex.com", na=False) | excelRead['mails'].str.endswith("@mynet.com", na=False) | excelRead['mails'].str.endswith("@hotmail.com.tr", na=False) | excelRead['mails'].str.endswith("@yahoo.com", na=False)]
lower = excelTest['mails'].str.lower()
testBad = excelTest['mails'].apply(bad_chars)
print(testBad)
But this is the output I got. Where do you think I went wrong or how can I achieve this?
Output
0 ü
1 ü
2 ü
3 ü
4 ü
..
107808 ü
107809 ü
107810 ü
107811 ü
107812 ü
Name: mails, Length: 104507, dtype: object
before bad_chars()
function sample output:
0 [email protected]
1 [email protected]
2 [email protected]
3 [email protected]
4 [email protected]
...
107808 [email protected]
107809 [email protected]
107810 [email protected]
107811 [email protected]
107812 [email protected]
Upvotes: -1
Views: 46
Reputation: 11657
You can use contains
with a regular expression that checks for a match for any char, |
. We must take care to escape the chars that have special meaning in regular expressions. (I based this on this answer.)
import re
email_domains = ["gmail.com","hotmail.com","outlook.com","icloud.com",
"windowslive.com","yandex.com", "mynet.com","hotmail.com.tr",
"yahoo.com"]
# check if domain is in domain list
excelTest = excelRead[excelRead['mails'].str.split("@").str[-1].isin(email_domains)]
# look for bad chars with a regex
regex = '|'.join(re.escape(s.lower()) for s in badCharsList)
excelTest = excelTest[~excelTest['mails'].str.lower().str.contains(regex)]
Upvotes: 1