Steve
Steve

Reputation: 598

Check entire pandas data frame for emails and phone numbers

I have a continually growing large dataset with ~52 columns. I am trying to come up with a way to periodically check and flag the data for things like emails and phone numbers, so that it can be dealt with manually downstream.

I can find emails and numbers, but I haven't had great success applying it to the entire data frame dynamically. I have attempted to place the 'finders' in functions, and then apply to the entire data frame. I'm just not sure where to go from there either.

My desired outcome would just be a list of Items that stated if they had a phone or an email in any of the columns. (Eventually I may want to list out which columns have the issues though.)

Thanks for any help that gets me going in the right direction!

import pandas as pd, phonenumbers, re

#create small dummy data set
df = pd.DataFrame({
    'ID':[1,2,5,25,26],
    'Lineage':['apple', 'square', 'please Gino Mcneill [email protected]', 'ball', '888-555-5556 Ryan Parkes [email protected]'],
    'ShortDesc':['618-552-2255','Gino Mcneill [email protected]','',' please call now','if you have trouble you should call 816-455-5599 ASAP' ],
    'LongDesc':['Eesha Hinton', 'for help with product 56789, call 618-578-0055 immediately, or email Gino Mcneill [email protected]', 'maybe six five today for ever','more random text that could be really long and annoying','over the hills and through the woods']
    })

#find phone numbers in one column
for row in df['ShortDesc']:  
    for match in phonenumbers.PhoneNumberMatcher(row, "US"):
            print (phonenumbers.format_number(match.number, phonenumbers.PhoneNumberFormat.E164))

#find emails in one column
for row in df['ShortDesc']:  
    for match in re.findall('\S+@\S+', row): 
        print(match)

#create function for finding phone numbers
def phone_cleaner(x):
    for match in phonenumbers.PhoneNumberMatcher(x,"US"):
        return phonenumbers.format_number(match.number, phonenumbers.PhoneNumberFormat.E164)

#apply finding function to whole df
df2 = df
df2['ShortDesctel'] = df2['ShortDesc'].apply(phone_cleaner)
df2


#pseudo code:
#for each row
#for each column
#does email exist
#does phone number exist
#if yes to email
#if yes to phone number
#return ID

#Desired Result:
#ID, Email, Phone
#1          x
#2   x      x
#5   x
#25            ----Edit after the fact, this row wouldn't be displayed
#26  x      x

Upvotes: 2

Views: 1180

Answers (2)

MDR
MDR

Reputation: 2670

Maybe...

import pandas as pd
import re
import numpy as np

# regex patterns for email and phone numbers (could be simpler if more basic matching is required)
email_regex_pattern = r'(?:[a-z0-9!#$%&''*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&''*+/=?^_`{|}~-]+)*|\"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*\")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])'
phone_number_regex_pattern = r'(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?'

#create small dummy data set
df = pd.DataFrame({
    'ID':[1,2,5,25,26],
    'Lineage':['apple', 'square', 'please Gino Mcneill [email protected]', 'ball', '888-555-5556 Ryan Parkes [email protected]'],
    'ShortDesc':['618-552-2255','Gino Mcneill [email protected]','',' please call now','if you have trouble you should call 816-455-5599 ASAP' ],
    'LongDesc':['Eesha Hinton', 'for help with product 56789, call 618-578-0055 immediately, or email Gino Mcneill [email protected]', 'maybe six five today for ever','more random text that could be really long and annoying','over the hills and through the woods']
    })


mask = df[['Lineage', 'ShortDesc', 'LongDesc']].apply(lambda x: x.str.contains(email_regex_pattern,regex=True)).any(axis=1)
print('Rows with emails:\n')
print(df[mask])

Outputs:

Rows with emails:

    ID  Lineage ShortDesc   LongDesc
1   2   square  Gino Mcneill [email protected]  for help with product 56789, call 618-578-0055...
2   5   please Gino Mcneill [email protected]        maybe six five today for ever
4   26  888-555-5556 Ryan Parkes [email protected]  if you have trouble you should call 816-455-55...   over the hills and through the woods

Then...

mask = df[['Lineage', 'ShortDesc', 'LongDesc']].apply(lambda x: x.str.contains(phone_number_regex_pattern,regex=True)).any(axis=1)
print('\n\nRows with phone numbers:\n')
print(df[mask])

Outputs:

Rows with phone numbers:

ID  Lineage ShortDesc   LongDesc
0   1   apple   618-552-2255    Eesha Hinton
1   2   square  Gino Mcneill [email protected]  for help with product 56789, call 618-578-0055...
4   26  888-555-5556 Ryan Parkes [email protected]  if you have trouble you should call 816-455-55...   over the hills and through the woods

Better test dataframe required for further testing.

Upvotes: 0

Arkadiusz
Arkadiusz

Reputation: 1875

We could create new dataframe and find e-mails and phones with regex, only for columns with 'object' type:

(pd.DataFrame({'ID': df['ID'],
               'Email': df.select_dtypes(object)
                          .applymap(lambda x: bool(re.findall('\S+@\S+', str(x))))
                          .any(axis=1),
               'Phone': df.select_dtypes(object)
                          .applymap(lambda x: bool(re.findall('\d+-\d+-\d+', str(x))))
                          .any(axis=1)}
              )
   .replace({True: 'x', False: ''})
) 

Output:

    ID  Email   Phone
0    1              x
1    2      x       x
2    5      x   
3   25      
4   26      x       x

If you wanna use 'phonecleaner' function, you can put it there instead of regex pattern suggested in my solution.

Upvotes: 1

Related Questions