kcEmenike
kcEmenike

Reputation: 172

Validate strings using regex in pandas

I need a bit of help.

I'm pretty new to Python (I use version 3.0 bundled with Anaconda) and I want to use regex to validate/return a list of only valid numbers that match a criteria (say \d{11} for 11 digits). I'm getting the list using Pandas

df = pd.DataFrame(columns=['phoneNumber','count'], data=[
    ['08034303939',11],
    ['08034382919',11],
    ['0802329292',10],
    ['09039292921',11]])

When I return all the items using

for row in df.iterrows(): # dataframe.iterrows() returns tuple
    print(row[1][0])

it returns all items without regex validation, but when I try to validate with this

for row in df.iterrows(): # dataframe.iterrows() returns tuple
    print(re.compile(r"\d{11}").search(row[1][0]).group())

it returns an Attribute error (since the returned value for non-matching values is None.

How can I work around this, or is there an easier way?

Upvotes: 2

Views: 4575

Answers (1)

cs95
cs95

Reputation: 402814

If you want to validate, you can use str.match and convert to a boolean mask using df.astype(bool):

x = df['phoneNumber'].str.match(r'\d{11}').astype(bool)
x

0     True
1     True
2    False
3     True
Name: phoneNumber, dtype: bool

You can use boolean indexing to return only rows with valid phone numbers.

df[x]

   phoneNumber  count
0  08034303939     11
1  08034382919     11
3  09039292921     11

Upvotes: 5

Related Questions