Reputation: 1119
I'm running a process to clean up some telephone numbers (UK) and have decided to run a lambda function across a Pandas DataFrame using regex/substitution to remove characters that I do not want to include (non-numeric, allowing a +)
Code is as follows: (phone_test is just a DataFrame of test examples, two columns, an index and the values)
def clean_phone_number(tel_no):
for row in test_data:
row = re.sub('[^?0-9+]+', '', row)
return(row)
phone_test_result = phone_test['TEL_NUMBER'].apply(lambda x: clean_phone_number(x))
The problem that I've got is that is that the outcome (phone_test_result) just returns the index of the phone_test dataframe and not the newly formatted telephone number. I've been wracking my brain for a couple of hours but I'm sure its a simple problem.
At first I thought it was just the positioning of the return line (it should be under the for, right?) but when I do that I just get an output of a single phone number, repeated for the length of the loop (that isnt even in the phone_test dataframe!)
PLS HALP SO. thank you.
after the responses, this is what I've ended up with:
clean the phone number using regex and only take the first 13 characters
- substituting a leading zero with +44
- deleting everything with a length of less than 13 characters.
It's not perfect;
- there are some phone numbers with legit less digits
- means i trim out all of the extension numbers
def clean_phone_number(tel_no):
clean_tel = re.sub('[^?0-9+]+', '', tel_no)[:13]
if clean_tel[:1] == '0':
clean_tel = '+44'+clean_tel[1:]
if len(clean_tel) < 13:
clean_tel = ''
return(clean_tel)
Upvotes: 1
Views: 2202
Reputation: 164693
pd.Series.apply
applies a function to each value in a series. Notice lambda
is unnecessary.
import re
phone_test = pd.DataFrame({'TEL_NUMBER': ['+44-020841396', '+44-07721-051-851']})
def clean_phone_number(tel_no):
return re.sub('[^?0-9+]+', '', tel_no)
phone_test_result = phone_test['TEL_NUMBER'].apply(clean_phone_number)
# 0 +44020841396
# 1 +4407721051851
# Name: TEL_NUMBER, dtype: object
pd.DataFrame.apply
, in contrast, applies a function to each row in a dataframe:
def clean_phone_number(row):
return re.sub('[^?0-9+]+', '', row['TEL_NUMBER'])
phone_test_result = phone_test.apply(clean_phone_number, axis=1)
# 0 +44020841396
# 1 +4407721051851
# Name: TEL_NUMBER, dtype: object
Upvotes: 3
Reputation: 1598
You don't have to loop , the function will be executed for each element
def clean_phone_number(tel_no):
return re.sub('[^?0-9+]+', '', tel_no)
or directly
phone_test_result = phone_test['TEL_NUMBER'].apply(lambda x: re.sub('[^?0-9+]+', '', x))
Upvotes: 2