Reputation: 1376
I have a csv of a single column containing supposedly phone numbers. However, there are several cells containing the wrong number of digits, alphabet and symbols, etc. I just want to keep every row whose cell contains exactly ten numerical digits and delete all other rows.
I also have a file that includes a weird non-ascii character in the middle of every cell in the name column (after the first name). Is there a simple way to discard every non-ascii character in a file? I just want to get rid of it, though replacing it with a space would be fine, too.
I already have pandas imported in the code I'm running on my csvs, so if that helps, it's available.
Upvotes: 1
Views: 1197
Reputation: 103884
First, let's generate some random phone numbers -- most will be good, some bad:
import random
li=[]
for i in range(50):
li.append('{}-{}-{}'.format(random.randrange(1,999), random.randrange(1,999), random.randrange(100,9999)))
Then filter those:
import re
new_li=[e for e in li if re.match(r'\d\d\d-\d\d\d-\d\d\d\d', e)]
print new_li
print set(li)-set(new_li) # show the ones removed...
If you want to do it Pandas, you can use the same regex and use .apply
to the DataFrame:
df=pd.DataFrame(li, columns=['Phone'])
pat=re.compile(r'^\d{3}-\d{3}-\d{4}$')
print df
print df[df['Phone'].apply(lambda s: bool(pat.search(s)))]
Upvotes: 3