Reputation: 4343
I've got a bunch of addresses like so:
df['street'] =
5311 Whitsett Ave 34
355 Sawyer St
607 Hampshire Rd #358
342 Old Hwy 1
267 W Juniper Dr 402
What I want to do is to remove those numbers at the end of the street part of the addresses to get:
df['street'] =
5311 Whitsett Ave
355 Sawyer St
607 Hampshire Rd
342 Old Hwy 1
267 W Juniper Dr
I have my regular expression like this:
df['street'] = df.street.str.replace(r"""\s(?:dr|ave|rd)[^a-zA-Z]\D*\d+$""", '', case=False)
which gives me this:
df['street'] =
5311 Whitsett
355 Sawyer St
607 Hampshire
342 Old Hwy 1
267 W Juniper
It dropped the words 'Ave', 'Rd' and 'Dr' from my original street addresses. Is there a way to keep part of the regular expression pattern (in my case this is 'Ave', 'Rd', 'Dr' and replace the rest?
EDIT:
Notice the address 342 Old Hwy 1
. I do not want to also take out the number in such cases. That's why I specified the patterns ('Ave', 'Rd', 'Dr', etc) to have a better control of who gets changed.
Upvotes: 0
Views: 857
Reputation: 5658
df_street = '''
5311 Whitsett Ave 34
355 Sawyer St
607 Hampshire Rd #358
342 Old Hwy 1
267 W Juniper Dr 402
'''
# digits on the end are preceded by one of ( Ave, Rd, Dr), space,
# may be preceded by a #, and followed by a possible space, and by the newline
df_street = re.sub(r'(Ave|Rd|Dr)\s+#?\d+\s*\n',r'\1\n', df_street,re.MULTILINE|re.IGNORECASE)
print(df_street)
5311 Whitsett Ave
355 Sawyer St
607 Hampshire Rd
342 Old Hwy 1
267 W Juniper Dr
Upvotes: 1
Reputation: 1852
You should use the following regex:
>>> import re
>>> example_str = "607 Hampshire Rd #358"
>>> re.sub(r"\s*\#?[^\D]+\s*$", r"", example_str)
'607 Hampshire Rd'
Upvotes: 0