Insert space after the second or third capital letter python

Question

I have a pandas dataframe containing addresses. Some are formatted correctly like 481 Rogers Rd York ON. Others have a space missing between the city quandrant and the city name, for example: 101 9 Ave SWCalgary AB or even possibly: 101 9 Ave SCalgary AB, where SW refers to south west and S to south.

I'm trying to find a regex that will add a space between second and third capital letters if they are followed by lowercase letters, or if there are only 2 capitals followed by lower case, add a space between the first and second.

So far, I've found that ([A-Z]{2,3}[a-z]) will match the situation correctly, but I can't figure out how to look back into it and sub at position 2 or 3. Ideally, I'd like to use an index to split the match at [-2:] but I can't figure out how to do this.

I found that re.findall('(?<=[A-Z][A-Z])[A-Z][a-z].+', '101 9 Ave SWCalgary AB') will return the last part of the string and I could use a look forward regex to find the start and then join them but this seems very inefficient.

Thanks

Wiktor Stribiżew · Accepted Answer

You may use

df['Test'] = df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')

See this regex demo

Details

\b - a word boundary
([A-Z]{1,2}) - Capturing group 1 (later referred with \1 from the replacement pattern): one or two uppercase letters
([A-Z][a-z]) - Capturing group 2 (later referred with \2 from the replacement pattern): an uppercase letter + a lowercase one.

If you want to specifically match city quadrants, you may use a bit more specific regex:

df['Test'] = df['Test'].str.replace(r'\b([NS][EW]|[NESW])([A-Z][a-z])', r'\1 \2')

See this regex demo. Here, [NS][EW]|[NESW] matches N or S that are followed with E or W, or a single N, E, S or W.

Pandas demo:

import pandas as pd
df = pd.DataFrame({'Test':['481 Rogers Rd York ON', 
'101 9 Ave SWCalgary AB',
'101 9 Ave SCalgary AB']})
>>> df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
0      481 Rogers Rd York ON
1    101 9 Ave SW Calgary AB
2     101 9 Ave S Calgary AB
Name: Test, dtype: object

Insert space after the second or third capital letter python

Answers (2)

Related Questions