Check entire pandas data frame for emails and phone numbers

Question

I have a continually growing large dataset with ~52 columns. I am trying to come up with a way to periodically check and flag the data for things like emails and phone numbers, so that it can be dealt with manually downstream.

I can find emails and numbers, but I haven't had great success applying it to the entire data frame dynamically. I have attempted to place the 'finders' in functions, and then apply to the entire data frame. I'm just not sure where to go from there either.

My desired outcome would just be a list of Items that stated if they had a phone or an email in any of the columns. (Eventually I may want to list out which columns have the issues though.)

Thanks for any help that gets me going in the right direction!

import pandas as pd, phonenumbers, re

#create small dummy data set
df = pd.DataFrame({
    'ID':[1,2,5,25,26],
    'Lineage':['apple', 'square', 'please Gino Mcneill gm@yahoo.com', 'ball', '888-555-5556 Ryan Parkes rp@abc.io'],
    'ShortDesc':['618-552-2255','Gino Mcneill gm@github.com','',' please call now','if you have trouble you should call 816-455-5599 ASAP' ],
    'LongDesc':['Eesha Hinton', 'for help with product 56789, call 618-578-0055 immediately, or email Gino Mcneill gm@yahoo.com', 'maybe six five today for ever','more random text that could be really long and annoying','over the hills and through the woods']
    })

#find phone numbers in one column
for row in df['ShortDesc']:  
    for match in phonenumbers.PhoneNumberMatcher(row, "US"):
            print (phonenumbers.format_number(match.number, phonenumbers.PhoneNumberFormat.E164))

#find emails in one column
for row in df['ShortDesc']:  
    for match in re.findall('\S+@\S+', row): 
        print(match)

#create function for finding phone numbers
def phone_cleaner(x):
    for match in phonenumbers.PhoneNumberMatcher(x,"US"):
        return phonenumbers.format_number(match.number, phonenumbers.PhoneNumberFormat.E164)

#apply finding function to whole df
df2 = df
df2['ShortDesctel'] = df2['ShortDesc'].apply(phone_cleaner)
df2


#pseudo code:
#for each row
#for each column
#does email exist
#does phone number exist
#if yes to email
#if yes to phone number
#return ID

#Desired Result:
#ID, Email, Phone
#1          x
#2   x      x
#5   x
#25            ----Edit after the fact, this row wouldn't be displayed
#26  x      x

Arkadiusz · Accepted Answer

We could create new dataframe and find e-mails and phones with regex, only for columns with 'object' type:

(pd.DataFrame({'ID': df['ID'],
               'Email': df.select_dtypes(object)
                          .applymap(lambda x: bool(re.findall('\S+@\S+', str(x))))
                          .any(axis=1),
               'Phone': df.select_dtypes(object)
                          .applymap(lambda x: bool(re.findall('\d+-\d+-\d+', str(x))))
                          .any(axis=1)}
              )
   .replace({True: 'x', False: ''})
)

Output:

    ID  Email   Phone
0    1              x
1    2      x       x
2    5      x   
3   25      
4   26      x       x

If you wanna use 'phonecleaner' function, you can put it there instead of regex pattern suggested in my solution.

Check entire pandas data frame for emails and phone numbers

Answers (2)

Related Questions