Reputation: 3
I have successfully extracted the 2 sheets of data and appended but I want to clean the phone number field. The replace line is not erroring but also not doing anything.
Is there another method I should use to clean the phone number string?
filename = 'file.xlsx'
wb = xw.Book(filename)
sheet1 = wb.sheets['sheet1']
df1 = sheet1.used_range.options(pd.DataFrame, index=False, header=True).value
sheet2 = wb.sheets['sheet2']
df2 = sheet2.used_range.options(pd.DataFrame, index=False, header=True).value
wb.close()
lists_combined = pd.concat([df1, df2])
lists_combined['filename'] = filename
lists_combined['CustomerVoicePhone'] = lists_combined['CustomerVoicePhone'].replace('-','').replace('(','').replace(')','').replace('+','').replace(' ','')
lists_combined = lists_combined.filter(items=['filename','CustomerEmail', 'CustomerVoicePhone','CustomerTextPhone'])
Upvotes: 0
Views: 116
Reputation: 1252
You can apply to all the rows a filtering lambda function which takes every character and only keeps digits:
lists_combined['CustomerVoicePhone'] = (lists_combined.CustomerVoicePhone
.map(lambda x: ''.join(filter(str.isdigit, x))))
In terms of performance, we can compare it with the other answer in the following code, and see that it's a bit faster for a large dataframe (100k phone numbers):
def gen_phone():
first = str(random.randint(100,999))
second = str(random.randint(1,888)).zfill(3)
last = (str(random.randint(1,9998)).zfill(4))
while last in ['1111','2222','3333','4444','5555','6666','7777','8888']:
last = (str(random.randint(1,9998)).zfill(4))
return '{}-{}-{}'.format(first,second, last)
df = pd.DataFrame(columns=['p'])
for _ in range(100000):
p = gen_phone()
df = df.append({'p':p}, ignore_index=True)
def method1():
regex = '\)|\(|-|\+|\s' #or regex = '[\(\)\+\-\s]' using character class
df['p_1'] = (df['p'].str.replace(regex,'')
.fillna(df['p']))
%time method1()
# Wall time: 166 ms
def method2():
df['p_2'] = (df.p.map(lambda x: ''.join(filter(str.isdigit, x))))
%time method2()
# Wall time: 151 ms
Upvotes: 1
Reputation: 153460
Let's use .str access with repace
and a regex:
regex = '\)|\(|-|\+|\s' #or regex = '[\(\)\+\-\s]' using character class
lists_combined['CustomerVoicePhone'] = (lists_combined['CustomerVoicePhone'].str.replace(regex,'')
.fillna(list_combine['CustomerVoicePhone']))
Upvotes: 0
Reputation: 1
First you should avoid your serie of replace which impact the lisibility of your code .You could use a list inside of the replace fonction for the elements you want to replace by an empty string...
But the main pb of your code is that it should be : df.str.replace() to replace and not just df.replace()
Cheers
Upvotes: 0