glpsx
glpsx

Reputation: 679

RegEx to split address into three distinct Series [Part 2]

This is the continuation of my previous post on using regular expressions to split a pandas Series containing addresses into three disctint fields (the street, the number, and the box).

My initial example was the following:

import pandas as pd
import numpy as np

df = pd.DataFrame({'cus_name' : ['James', 'Mary', 'David', 'Linda', 'George', 'Jennifer', 'John', 'Maria', 'Charles', 'Helen'],
                   'address' : ['Main St 59', 'Yellow Av 11 b.F1', 'Terrazzo Way 100-102', np.nan, 'Hamilton St 159 b.A/B', np.nan, 'Henry St 7 D', 'Mc-Kenzie Av 40P b.1', 'Neptune Av 14 15 b.G', np.nan ], 
                   'postal_code' : [1410, 1210, 1020, np.nan, 1310, np.nan, 1080, 1190, 1040, np.nan], 
                  })

print(df)

   cus_name                address  postal_code
0     James             Main St 59       1410.0
1      Mary      Yellow Av 11 b.F1       1210.0
2     David   Terrazzo Way 100-102       1020.0
3     Linda                    NaN          NaN
4    George  Hamilton St 159 b.A/B       1310.0
5  Jennifer                    NaN          NaN
6      John           Henry St 7 D       1080.0
7     Maria   Mc-Kenzie Av 40P b.1       1190.0
8   Charles   Neptune Av 14 15 b.G       1040.0
9     Helen                    NaN          NaN

Using the regex pattern from the solution given by RomanPerekhrest, the address Series nicely splits into the 3 desired fields.

pattern = pattern ='(\D+)\s+(\d+[\s-]?(?!b)\w*)(?:\s+b\.)?(\S+)?'
print(df['address'].str.extract(pattern, expand = True))

              0        1    2
0       Main St       59  NaN
1     Yellow Av       11   F1
2  Terrazzo Way  100-102  NaN
3           NaN      NaN  NaN
4   Hamilton St      159  A/B
5           NaN      NaN  NaN
6      Henry St      7 D  NaN
7  Mc-Kenzie Av      40P    1
8    Neptune Av    14 15    G
9           NaN      NaN  NaN

Unfortunately, in my previous post, I didn't account for the case where the address only contains the street information (e.g. Place de la Monnaie).

In this case, the above regex pattern don't work anymore. See this regex101 link.

I tried to modify the regex pattern for half-an-hour to account for this case without any success. What I noticed is that even though the number field can have word characters, it always starts with a digit when non-missing.

Any additional help would be appreciated.

Upvotes: 0

Views: 69

Answers (1)

Sonu Sharma
Sonu Sharma

Reputation: 334

this pattern can help:

(\D+)\s((\d+[\s-]?(?!b)\w*)(?:\s+b\.)?(\S+)?)*

Upvotes: 1

Related Questions