Applying lambda function to pandas dataframe - returns index but not values?

Question

I'm running a process to clean up some telephone numbers (UK) and have decided to run a lambda function across a Pandas DataFrame using regex/substitution to remove characters that I do not want to include (non-numeric, allowing a +)

Code is as follows: (phone_test is just a DataFrame of test examples, two columns, an index and the values)

def clean_phone_number(tel_no):
    for row in test_data:
        row = re.sub('[^?0-9+]+', '', row)
        return(row)

phone_test_result = phone_test['TEL_NUMBER'].apply(lambda x: clean_phone_number(x))

The problem that I've got is that is that the outcome (phone_test_result) just returns the index of the phone_test dataframe and not the newly formatted telephone number. I've been wracking my brain for a couple of hours but I'm sure its a simple problem.

At first I thought it was just the positioning of the return line (it should be under the for, right?) but when I do that I just get an output of a single phone number, repeated for the length of the loop (that isnt even in the phone_test dataframe!)

PLS HALP SO. thank you.

after the responses, this is what I've ended up with:

clean the phone number using regex and only take the first 13 characters
- substituting a leading zero with +44
- deleting everything with a length of less than 13 characters.
It's not perfect;
- there are some phone numbers with legit less digits
- means i trim out all of the extension numbers

def clean_phone_number(tel_no):
    clean_tel = re.sub('[^?0-9+]+', '', tel_no)[:13]
    if clean_tel[:1] == '0':
        clean_tel = '+44'+clean_tel[1:]
        if len(clean_tel) < 13:
            clean_tel = ''
    return(clean_tel)

jpp · Accepted Answer

pd.Series.apply applies a function to each value in a series. Notice lambda is unnecessary.

import re

phone_test = pd.DataFrame({'TEL_NUMBER': ['+44-020841396', '+44-07721-051-851']})

def clean_phone_number(tel_no):
     return re.sub('[^?0-9+]+', '', tel_no)

phone_test_result = phone_test['TEL_NUMBER'].apply(clean_phone_number)

# 0      +44020841396
# 1    +4407721051851
# Name: TEL_NUMBER, dtype: object

pd.DataFrame.apply, in contrast, applies a function to each row in a dataframe:

def clean_phone_number(row):
     return re.sub('[^?0-9+]+', '', row['TEL_NUMBER'])

phone_test_result = phone_test.apply(clean_phone_number, axis=1)

# 0      +44020841396
# 1    +4407721051851
# Name: TEL_NUMBER, dtype: object

Applying lambda function to pandas dataframe - returns index but not values?

Answers (2)

Related Questions