Reputation: 384
I have a column with 25k + records containing similar information in the example data frame.
Example data frame:
df = pd.DataFrame({'Address': ['Corner Great North Rd & Pipiwai Rd RD 6 Whangarei 2104',
'2305/142 Shakespeare Road Takapuna North Shore 0622',
'29 Stilwell Rd Mt Albert Auckland', '10/70 Atkinson Ave Otahuhu Auckland 1062']})
I'm trying to remove the trailing numbers in each record so ideally I would get a column that contains:
Address
--------
Corner Great North Rd & Pipiwai Rd RD 6 Whangarei
2305/142 Shakespeare Road Takapuna North Shore
29 Stilwell Rd Mt Albert Auckland
10/70 Atkinson Ave Otahuhu Auckland
I've tried using regex to remove all the characters at the end of string until it hits the first white space.
pattern = re.sub("(.*\s).*", '\\1', str)
df['Address'] = df.str.replace(pattern, '', regex=True)
This throws a TypeError exception which I think is caused from the numbers in the string. However I believe this code could also delete any trailing words deleting information I want to keep.
My question: Is there a regex pattern that could be applied for the entire column?
Upvotes: 0
Views: 718
Reputation: 13821
Or you can use rstrip()
and remove the trailing digits:
df['Address'] = df['Address'].str.rstrip(' 0123456789') # removed an additional trailing space
Prints:
df
Out[196]:
Address
0 Corner Great North Rd & Pipiwai Rd RD 6 Whanga...
1 2305/142 Shakespeare Road Takapuna North Shore
2 29 Stilwell Rd Mt Albert Auckland
3 10/70 Atkinson Ave Otahuhu Auckland
Upvotes: 1
Reputation: 520968
Using str.replace
:
df['Address'] = df['Address'].str.replace(r'\s+\d+$', '')
The regex pattern \s+\d+$
will match one or more whitespace characters followed by a number trailing to the end of the address.
Upvotes: 2