Manuel
Manuel

Reputation: 143

how do I best validate email in pandas data frame

I have a data frame (df) with emails and numbers like

    email                          euro
0   [email protected]      150
1   [email protected]     50
2   [email protected]      300
3   kjfslkfj                         0
4   [email protected]    200

I need to filter all rows with correct emails and euro equal to or greater than 100 and another list with correct emails and euro lower than 100. I know that I can filter by euro like this

df_gt_100 = df.euro >= 100

and

df_lt_100 = df.euro < 100

But I can't find a way to filter the email addresses. I imported the email_validate package and tried things like this

validate_email(df.email)

which gives me a TypeError: expected string or bytes-like object.

Can anyone pls give me a hint how to approach this issue. It'd be nice if I could do this all in one filter with the AND and OR operators.

Thanks in advance, Manuel

Upvotes: 5

Views: 10069

Answers (4)

jezrael
jezrael

Reputation: 863166

Use apply, chain mask by & for AND and filter by boolean indexing:

from validate_email import validate_email

df1 = df[(df['euro'] > 100) & df['email'].apply(validate_email)]
print (df1)
                         email  euro
0    [email protected]   150
2    [email protected]   300
4  [email protected]   200

Another approach with regex and contains:

df1 = df[(df['euro'] > 100) &df['email'].str.contains(r'[^@]+@[^@]+\.[^@]+')]
print (df1)
                         email  euro
0    [email protected]   150
2    [email protected]   300
4  [email protected]   200

Upvotes: 10

Amit Basuri
Amit Basuri

Reputation: 613

In [30]: from validate_email import validate_email

In [31]: df
Out[31]: 
                       email
0  [email protected]
1                   kjfslkfj

In [32]: df['is_valid_email'] = df['email'].apply(lambda x:validate_email(x))

In [33]: df
Out[33]: 
                       email  is_valid_email
0  [email protected]            True
1                   kjfslkfj           False

In [34]: df['email'][df['is_valid_email']]
Out[34]: 
0    [email protected]

Upvotes: 5

Scratch&#39;N&#39;Purr
Scratch&#39;N&#39;Purr

Reputation: 10429

You can use regex expressions to find a match and then use apply on the email column to create a T/F column for where an email exists:

import re
import pandas as pd

pattern = re.compile(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)")  # this is the regex expression to search on

df = pd.DataFrame({'email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', 'notanemail'], 'euro': [123, 321, 150, 0, 133]})
df['isemail'] = df['email'].apply(lambda x: True if pattern.match(x) else False)

Result:

                     email  euro  isemail
0     [email protected]   123     True
1        [email protected]   321     True
2  [email protected]   150     True
3        [email protected]     0     True
4               notanemail   133    False

Now you can filter on the isemail column.

Upvotes: 3

Aritesh
Aritesh

Reputation: 2103

validate_email returns a lot of information e.g. smtp etc., and for invalid emails, it throws an EmailNotValidError exception. You can write a function, and apply on pandas series -

from email_validator import validate_email, EmailNotValidError

def validate_e(x):
    try:
        v = validate_email(x)
        return True
    except EmailNotValidError as e:
        return False

df["Email_validate"] = df['email'].apply(validate_e)

Upvotes: 1

Related Questions