RegEx to split address into three distinct Series [Part 1]

Question

I'm experimenting/learning Python with a data set containing customers information.

The DataFrame structure is the following (these are made up records):

import pandas as pd
import numpy as np

df = pd.DataFrame({'cus_name' : ['James', 'Mary', 'David', 'Linda', 'George', 'Jennifer', 'John', 'Maria', 'Charles', 'Helen'],
                   'address' : ['Main St 59', 'Yellow Av 11 b.F1', 'Terrazzo Way 100-102', np.nan, 'Hamilton St 159 b.A/B', np.nan, 'Henry St 7 D', 'Mc-Kenzie Av 40P b.1', 'Neptune Av 14 15 b.G', np.nan ], 
                   'postal_code' : [1410, 1210, 1020, np.nan, 1310, np.nan, 1080, 1190, 1040, np.nan], 
                  })

print(df)

   cus_name                address  postal_code
0     James             Main St 59       1410.0
1      Mary      Yellow Av 11 b.F1       1210.0
2     David   Terrazzo Way 100-102       1020.0
3     Linda                    NaN          NaN
4    George  Hamilton St 159 b.A/B       1310.0
5  Jennifer                    NaN          NaN
6      John           Henry St 7 D       1080.0
7     Maria   Mc-Kenzie Av 40P b.1       1190.0
8   Charles   Neptune Av 14 15 b.G       1040.0
9     Helen                    NaN          NaN

I'm particularly interested in the address Series. Specifically, my goal is to "split" the information of the street, number, and box into three distinct Series.

For instance, after the transformation, the first and seventh record/row should look like this:

| cus_name | street       | number | box | postal_code |
|----------|--------------|--------|-----|-------------|
| James    | Main St      | 59     | NaN | 1410        |
| Maria    | Mc-Kenzie Av | 40P    | 1   | 1190.0      |

At first, I had no idea how to tackle this problem. After doing some research here, I found some interesting related posts that use regular expressions.

Since I'm no expert in Python (nor regular expressions), I thought I could start by identifying the pattern in the address Series. In fact, each address has the following pattern:

The street part which is located at the beginning of the string. It is composed of one or more words separated by a white-space character or a dash (e.g. Mc-Kenzie Av);
The number part which is located in the middle of the string. It is composed of one or more alpha-numeric words separated by a white-space character or a dash (e.g. 100-102, 7 D);
The box part which is located at the end of the string. It always immediately follows the b.characters and is composed of one word containing alpha-numeric characters and possibly some special characters (e.g. A/B, F1).

I'm asking for help to achieve my desired goal using regular expressions (if regex is the solution).

RomanPerekhrest · Accepted Answer

Another regex approach:

In [913]: df[['street', 'number', 'box']] = df.address.str.extract(r'(\D+)\s+(\d+[\s-]?(?!b)\w*)(?:\s+b\.)?(\S+)?', expand=True)

In [914]: df
Out[914]: 
   cus_name                address  postal_code        street   number  box
0     James             Main St 59       1410.0       Main St       59  NaN
1      Mary      Yellow Av 11 b.F1       1210.0     Yellow Av       11   F1
2     David   Terrazzo Way 100-102       1020.0  Terrazzo Way  100-102  NaN
3     Linda                    NaN          NaN           NaN      NaN  NaN
4    George  Hamilton St 159 b.A/B       1310.0   Hamilton St      159  A/B
5  Jennifer                    NaN          NaN           NaN      NaN  NaN
6      John           Henry St 7 D       1080.0      Henry St      7 D  NaN
7     Maria   Mc-Kenzie Av 40P b.1       1190.0  Mc-Kenzie Av      40P    1
8   Charles   Neptune Av 14 15 b.G       1040.0    Neptune Av    14 15    G
9     Helen                    NaN          NaN           NaN      NaN  NaN

RegEx to split address into three distinct Series [Part 1]

Answers (2)

Related Questions