glpsx
glpsx

Reputation: 679

RegEx to split address into three distinct Series [Part 1]

I'm experimenting/learning Python with a data set containing customers information.

The DataFrame structure is the following (these are made up records):

import pandas as pd
import numpy as np

df = pd.DataFrame({'cus_name' : ['James', 'Mary', 'David', 'Linda', 'George', 'Jennifer', 'John', 'Maria', 'Charles', 'Helen'],
                   'address' : ['Main St 59', 'Yellow Av 11 b.F1', 'Terrazzo Way 100-102', np.nan, 'Hamilton St 159 b.A/B', np.nan, 'Henry St 7 D', 'Mc-Kenzie Av 40P b.1', 'Neptune Av 14 15 b.G', np.nan ], 
                   'postal_code' : [1410, 1210, 1020, np.nan, 1310, np.nan, 1080, 1190, 1040, np.nan], 
                  })

print(df)

   cus_name                address  postal_code
0     James             Main St 59       1410.0
1      Mary      Yellow Av 11 b.F1       1210.0
2     David   Terrazzo Way 100-102       1020.0
3     Linda                    NaN          NaN
4    George  Hamilton St 159 b.A/B       1310.0
5  Jennifer                    NaN          NaN
6      John           Henry St 7 D       1080.0
7     Maria   Mc-Kenzie Av 40P b.1       1190.0
8   Charles   Neptune Av 14 15 b.G       1040.0
9     Helen                    NaN          NaN

I'm particularly interested in the address Series. Specifically, my goal is to "split" the information of the street, number, and box into three distinct Series.

For instance, after the transformation, the first and seventh record/row should look like this:

| cus_name | street       | number | box | postal_code |
|----------|--------------|--------|-----|-------------|
| James    | Main St      | 59     | NaN | 1410        |
| Maria    | Mc-Kenzie Av | 40P    | 1   | 1190.0      |

At first, I had no idea how to tackle this problem. After doing some research here, I found some interesting related posts that use regular expressions.

Since I'm no expert in Python (nor regular expressions), I thought I could start by identifying the pattern in the address Series. In fact, each address has the following pattern:

I'm asking for help to achieve my desired goal using regular expressions (if regex is the solution).

Upvotes: 1

Views: 123

Answers (2)

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

Another regex approach:

In [913]: df[['street', 'number', 'box']] = df.address.str.extract(r'(\D+)\s+(\d+[\s-]?(?!b)\w*)(?:\s+b\.)?(\S+)?', expand=True)

In [914]: df
Out[914]: 
   cus_name                address  postal_code        street   number  box
0     James             Main St 59       1410.0       Main St       59  NaN
1      Mary      Yellow Av 11 b.F1       1210.0     Yellow Av       11   F1
2     David   Terrazzo Way 100-102       1020.0  Terrazzo Way  100-102  NaN
3     Linda                    NaN          NaN           NaN      NaN  NaN
4    George  Hamilton St 159 b.A/B       1310.0   Hamilton St      159  A/B
5  Jennifer                    NaN          NaN           NaN      NaN  NaN
6      John           Henry St 7 D       1080.0      Henry St      7 D  NaN
7     Maria   Mc-Kenzie Av 40P b.1       1190.0  Mc-Kenzie Av      40P    1
8   Charles   Neptune Av 14 15 b.G       1040.0    Neptune Av    14 15    G
9     Helen                    NaN          NaN           NaN      NaN  NaN

Upvotes: 2

Quang Hoang
Quang Hoang

Reputation: 150785

You can try this pattern:

pattern = "^(?P<street>.+)\s+(?P<number>[\d\w]+)\s+b\.(?P<box>.*)$"
df['address'].str.extract(pattern)

output

          street number  box
0            NaN    NaN  NaN
1      Yellow Av     11   F1
2            NaN    NaN  NaN
3            NaN    NaN  NaN
4    Hamilton St    159  A/B
5            NaN    NaN  NaN
6            NaN    NaN  NaN
7   Mc-Kenzie Av    40P    1
8  Neptune Av 14     15    G
9            NaN    NaN  NaN

For an explanation, paste the pattern here.

If you require the street be strictly without numbers, e.g. line 8 above, use this pattern:

pattern = "^(?P<street>[\D]+)\s+(?P<number>[\w\s]+)\s+b\.(?P<box>.*)$"

which gives:

         street number  box
0           NaN    NaN  NaN
1     Yellow Av     11   F1
2           NaN    NaN  NaN
3           NaN    NaN  NaN
4   Hamilton St    159  A/B
5           NaN    NaN  NaN
6           NaN    NaN  NaN
7  Mc-Kenzie Av    40P    1
8    Neptune Av  14 15    G
9           NaN    NaN  NaN

Upvotes: 2

Related Questions