Jason
Jason

Reputation: 2894

Replacing pandas column with a subset of itself through regex

I have data streaming in the following format:

from StringIO import StringIO
data ="""\
ANI/IP
sip:[email protected]
sip:10.66.7.34@6665554444
sip:[email protected]
"""
import pandas as pd
df = pd.read_table(StringIO(data),sep='\s+',dtype='str') 

What I would like to do is replace the column content with just the phone number part of the string above. I tried the suggestions from this thread like so:

df['ANI/IP'] = df['ANI/IP'].str.replace(r'\d{10}', '').astype('str')
print(df)

However, this results in:

.....print(df)
            ANI/IP
0  sip:@10.94.2.15
1  sip:@10.66.7.34
2  sip:@10.94.2.11

I need the phone numbers, so how do I achieve this? :

  ANI/IP
0 5554447777
1 6665554444
2 3337775555

Upvotes: 1

Views: 595

Answers (2)

Zero
Zero

Reputation: 76947

You could use pandas.core.strings.StringMethods.extract to extract

In [10]: df['ANI/IP'].str.extract("(\d{10})")
Out[10]:
0    5554447777
1    6665554444
2    3337775555
Name: ANI/IP, dtype: object

Upvotes: 1

Andy Hayden
Andy Hayden

Reputation: 375685

The regex \d{10} searches for substring of digits precisely 10 characters long.

df['ANI/IP'] = df['ANI/IP'].str.replace(r'\d{10}', '').astype('str')

This removes the numbers!
Note: You shouldn't do astype str (it's not needed and there is no str dtype in pandas).

You want to extract these phone numbers:

In [11]: df["ANI/IP"].str.extract(r'(\d{10})')  # before overwriting!
Out[11]:
0    5554447777
1    6665554444
2    3337775555
Name: ANI/IP, dtype: object

Set this as another column and you're away:

In [12]: df["phone_number"] = df["ANI/IP"].str.extract(r'(\d{10})')

Upvotes: 4

Related Questions