Reputation: 143
I have a data frame (df) with emails and numbers like
email euro
0 [email protected] 150
1 [email protected] 50
2 [email protected] 300
3 kjfslkfj 0
4 [email protected] 200
I need to filter all rows with correct emails and euro equal to or greater than 100 and another list with correct emails and euro lower than 100. I know that I can filter by euro like this
df_gt_100 = df.euro >= 100
and
df_lt_100 = df.euro < 100
But I can't find a way to filter the email addresses. I imported the email_validate package and tried things like this
validate_email(df.email)
which gives me a TypeError: expected string or bytes-like object.
Can anyone pls give me a hint how to approach this issue. It'd be nice if I could do this all in one filter with the AND and OR operators.
Thanks in advance, Manuel
Upvotes: 5
Views: 10069
Reputation: 863166
Use apply
, chain mask by &
for AND
and filter by boolean indexing
:
from validate_email import validate_email
df1 = df[(df['euro'] > 100) & df['email'].apply(validate_email)]
print (df1)
email euro
0 [email protected] 150
2 [email protected] 300
4 [email protected] 200
Another approach with regex
and contains
:
df1 = df[(df['euro'] > 100) &df['email'].str.contains(r'[^@]+@[^@]+\.[^@]+')]
print (df1)
email euro
0 [email protected] 150
2 [email protected] 300
4 [email protected] 200
Upvotes: 10
Reputation: 613
In [30]: from validate_email import validate_email
In [31]: df
Out[31]:
email
0 [email protected]
1 kjfslkfj
In [32]: df['is_valid_email'] = df['email'].apply(lambda x:validate_email(x))
In [33]: df
Out[33]:
email is_valid_email
0 [email protected] True
1 kjfslkfj False
In [34]: df['email'][df['is_valid_email']]
Out[34]:
0 [email protected]
Upvotes: 5
Reputation: 10429
You can use regex expressions to find a match and then use apply on the email
column to create a T/F column for where an email exists:
import re
import pandas as pd
pattern = re.compile(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)") # this is the regex expression to search on
df = pd.DataFrame({'email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', 'notanemail'], 'euro': [123, 321, 150, 0, 133]})
df['isemail'] = df['email'].apply(lambda x: True if pattern.match(x) else False)
Result:
email euro isemail
0 [email protected] 123 True
1 [email protected] 321 True
2 [email protected] 150 True
3 [email protected] 0 True
4 notanemail 133 False
Now you can filter on the isemail
column.
Upvotes: 3
Reputation: 2103
validate_email
returns a lot of information e.g. smtp etc., and for invalid emails, it throws an EmailNotValidError
exception. You can write a function, and apply on pandas series -
from email_validator import validate_email, EmailNotValidError
def validate_e(x):
try:
v = validate_email(x)
return True
except EmailNotValidError as e:
return False
df["Email_validate"] = df['email'].apply(validate_e)
Upvotes: 1