Reputation: 538
I have a dataset having two columns:
Index Text
1 *some text* address13/b srs mall, indirapuram,sann-444000 *some text*
2 *some text*
3 *some text* contactus 12J 1st floor, jajan,totl-996633 *some text*
4 ..........
5 ........
I want a dataframe having a new column as "location" where only that string will get extracted from column "Text" that is beyond the keywords "address" or "contactus" till the 6 digits number and gives "NA" where string not get matched. Output what I want is something like:
Index location
1 13/b srs mall, indirapuram,sann-444000
2 NA
3 12J 1st floor, jajan,totl-996633
4 NA
Upvotes: 0
Views: 403
Reputation: 402263
Use str.extract
:
df['location'] = df.Text.str.extract('(?:address|contactus)(.*?\d{6})', expand=False)
df.drop('Text', 1)
Index location
0 1 13/b srs mall, indirapuram,sann-444000
1 2 NaN
2 3 12J 1st floor, jajan,totl-996633
As a helpful aside, when you have multiple items to check for, put them in a list and join them with str.join
:
terms = ['address', 'contactus', ...]
df['location'] = df.Text.str\
.extract(r'(?:{})(.*?\d{6})'.format('|'.join(terms), expand=False)
Regex Details
(?: # non-capturing group
address # "address"
| # regex OR
contactus # "contactus
)
(.*? # non-greedy match-all
\d{6} # 6 digit zipcode
)
Upvotes: 1