Reputation: 679
This is the continuation of my previous post on using regular expressions to split a pandas Series containing addresses into three disctint fields (the street, the number, and the box).
My initial example was the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'cus_name' : ['James', 'Mary', 'David', 'Linda', 'George', 'Jennifer', 'John', 'Maria', 'Charles', 'Helen'],
'address' : ['Main St 59', 'Yellow Av 11 b.F1', 'Terrazzo Way 100-102', np.nan, 'Hamilton St 159 b.A/B', np.nan, 'Henry St 7 D', 'Mc-Kenzie Av 40P b.1', 'Neptune Av 14 15 b.G', np.nan ],
'postal_code' : [1410, 1210, 1020, np.nan, 1310, np.nan, 1080, 1190, 1040, np.nan],
})
print(df)
cus_name address postal_code
0 James Main St 59 1410.0
1 Mary Yellow Av 11 b.F1 1210.0
2 David Terrazzo Way 100-102 1020.0
3 Linda NaN NaN
4 George Hamilton St 159 b.A/B 1310.0
5 Jennifer NaN NaN
6 John Henry St 7 D 1080.0
7 Maria Mc-Kenzie Av 40P b.1 1190.0
8 Charles Neptune Av 14 15 b.G 1040.0
9 Helen NaN NaN
Using the regex pattern from the solution given by RomanPerekhrest, the address
Series nicely splits into the 3 desired fields.
pattern = pattern ='(\D+)\s+(\d+[\s-]?(?!b)\w*)(?:\s+b\.)?(\S+)?'
print(df['address'].str.extract(pattern, expand = True))
0 1 2
0 Main St 59 NaN
1 Yellow Av 11 F1
2 Terrazzo Way 100-102 NaN
3 NaN NaN NaN
4 Hamilton St 159 A/B
5 NaN NaN NaN
6 Henry St 7 D NaN
7 Mc-Kenzie Av 40P 1
8 Neptune Av 14 15 G
9 NaN NaN NaN
Unfortunately, in my previous post, I didn't account for the case where the address only contains the street information (e.g. Place de la Monnaie
).
In this case, the above regex pattern don't work anymore. See this regex101 link.
I tried to modify the regex pattern for half-an-hour to account for this case without any success. What I noticed is that even though the number field can have word characters, it always starts with a digit when non-missing.
Any additional help would be appreciated.
Upvotes: 0
Views: 69
Reputation: 334
this pattern can help:
(\D+)\s((\d+[\s-]?(?!b)\w*)(?:\s+b\.)?(\S+)?)*
Upvotes: 1