Replacing pandas column with a subset of itself through regex

Question

I have data streaming in the following format:

from StringIO import StringIO
data ="""\
ANI/IP
sip:5554447777@10.94.2.15
sip:10.66.7.34@6665554444
sip:3337775555@10.94.2.11
"""
import pandas as pd
df = pd.read_table(StringIO(data),sep='\s+',dtype='str')

What I would like to do is replace the column content with just the phone number part of the string above. I tried the suggestions from this thread like so:

df['ANI/IP'] = df['ANI/IP'].str.replace(r'\d{10}', '').astype('str')
print(df)

However, this results in:

.....print(df)
            ANI/IP
0  sip:@10.94.2.15
1  sip:@10.66.7.34
2  sip:@10.94.2.11

I need the phone numbers, so how do I achieve this? :

  ANI/IP
0 5554447777
1 6665554444
2 3337775555

Andy Hayden · Accepted Answer

The regex \d{10} searches for substring of digits precisely 10 characters long.

df['ANI/IP'] = df['ANI/IP'].str.replace(r'\d{10}', '').astype('str')

This removes the numbers!
Note: You shouldn't do astype str (it's not needed and there is no str dtype in pandas).

You want to extract these phone numbers:

In [11]: df["ANI/IP"].str.extract(r'(\d{10})')  # before overwriting!
Out[11]:
0    5554447777
1    6665554444
2    3337775555
Name: ANI/IP, dtype: object

Set this as another column and you're away:

In [12]: df["phone_number"] = df["ANI/IP"].str.extract(r'(\d{10})')

Replacing pandas column with a subset of itself through regex

Answers (2)

Related Questions