Reputation: 318
I have a pandas dataframe containing addresses. Some are formatted correctly like 481 Rogers Rd York ON
. Others have a space missing between the city quandrant and the city name, for example: 101 9 Ave SWCalgary AB
or even possibly: 101 9 Ave SCalgary AB
, where SW
refers to south west and S
to south.
I'm trying to find a regex that will add a space between second and third capital letters if they are followed by lowercase letters, or if there are only 2 capitals followed by lower case, add a space between the first and second.
So far, I've found that ([A-Z]{2,3}[a-z])
will match the situation correctly, but I can't figure out how to look back into it and sub at position 2 or 3. Ideally, I'd like to use an index to split the match at [-2:]
but I can't figure out how to do this.
I found that re.findall('(?<=[A-Z][A-Z])[A-Z][a-z].+', '101 9 Ave SWCalgary AB')
will return the last part of the string and I could use a look forward regex to find the start and then join them but this seems very inefficient.
Thanks
Upvotes: 1
Views: 1084
Reputation: 627082
You may use
df['Test'] = df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
See this regex demo
Details
\b
- a word boundary([A-Z]{1,2})
- Capturing group 1 (later referred with \1
from the replacement pattern): one or two uppercase letters([A-Z][a-z])
- Capturing group 2 (later referred with \2
from the replacement pattern): an uppercase letter + a lowercase one.If you want to specifically match city quadrants, you may use a bit more specific regex:
df['Test'] = df['Test'].str.replace(r'\b([NS][EW]|[NESW])([A-Z][a-z])', r'\1 \2')
See this regex demo. Here, [NS][EW]|[NESW]
matches N
or S
that are followed with E
or W
, or a single N
, E
, S
or W
.
Pandas demo:
import pandas as pd
df = pd.DataFrame({'Test':['481 Rogers Rd York ON',
'101 9 Ave SWCalgary AB',
'101 9 Ave SCalgary AB']})
>>> df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
0 481 Rogers Rd York ON
1 101 9 Ave SW Calgary AB
2 101 9 Ave S Calgary AB
Name: Test, dtype: object
Upvotes: 1
Reputation: 370989
You can use
([A-Z]{1,2})(?=[A-Z][a-z])
to capture the first (or first and second) capital letters, and then use lookahead for a capital letter followed by a lowercase letter. Then, replace with the first group and a space:
re.sub(r'([A-Z]{1,2})(?=[A-Z][a-z])', r'\1 ', str)
https://regex101.com/r/TcB4Ph/1
Upvotes: 0