Reputation: 679
I'm experimenting/learning Python with a data set containing customers information.
The DataFrame structure is the following (these are made up records):
import pandas as pd
import numpy as np
df = pd.DataFrame({'cus_name' : ['James', 'Mary', 'David', 'Linda', 'George', 'Jennifer', 'John', 'Maria', 'Charles', 'Helen'],
'address' : ['Main St 59', 'Yellow Av 11 b.F1', 'Terrazzo Way 100-102', np.nan, 'Hamilton St 159 b.A/B', np.nan, 'Henry St 7 D', 'Mc-Kenzie Av 40P b.1', 'Neptune Av 14 15 b.G', np.nan ],
'postal_code' : [1410, 1210, 1020, np.nan, 1310, np.nan, 1080, 1190, 1040, np.nan],
})
print(df)
cus_name address postal_code
0 James Main St 59 1410.0
1 Mary Yellow Av 11 b.F1 1210.0
2 David Terrazzo Way 100-102 1020.0
3 Linda NaN NaN
4 George Hamilton St 159 b.A/B 1310.0
5 Jennifer NaN NaN
6 John Henry St 7 D 1080.0
7 Maria Mc-Kenzie Av 40P b.1 1190.0
8 Charles Neptune Av 14 15 b.G 1040.0
9 Helen NaN NaN
I'm particularly interested in the address
Series. Specifically, my goal is to "split" the information of the street, number, and box into three distinct Series.
For instance, after the transformation, the first and seventh record/row should look like this:
| cus_name | street | number | box | postal_code |
|----------|--------------|--------|-----|-------------|
| James | Main St | 59 | NaN | 1410 |
| Maria | Mc-Kenzie Av | 40P | 1 | 1190.0 |
At first, I had no idea how to tackle this problem. After doing some research here, I found some interesting related posts that use regular expressions.
Since I'm no expert in Python (nor regular expressions), I thought I could start by identifying the pattern in the address
Series. In fact, each address has the following pattern:
The street part which is located at the beginning of the string. It is composed of one or more words separated by a white-space character or a dash (e.g. Mc-Kenzie Av
);
The number part which is located in the middle of the string. It is composed of one or more alpha-numeric words separated by a white-space character or a dash (e.g. 100-102
, 7 D
);
The box part which is located at the end of the string. It always immediately follows the b.
characters and is composed of one word containing alpha-numeric characters and possibly some special characters (e.g. A/B
, F1
).
I'm asking for help to achieve my desired goal using regular expressions (if regex is the solution).
Upvotes: 1
Views: 123
Reputation: 92854
Another regex approach:
In [913]: df[['street', 'number', 'box']] = df.address.str.extract(r'(\D+)\s+(\d+[\s-]?(?!b)\w*)(?:\s+b\.)?(\S+)?', expand=True)
In [914]: df
Out[914]:
cus_name address postal_code street number box
0 James Main St 59 1410.0 Main St 59 NaN
1 Mary Yellow Av 11 b.F1 1210.0 Yellow Av 11 F1
2 David Terrazzo Way 100-102 1020.0 Terrazzo Way 100-102 NaN
3 Linda NaN NaN NaN NaN NaN
4 George Hamilton St 159 b.A/B 1310.0 Hamilton St 159 A/B
5 Jennifer NaN NaN NaN NaN NaN
6 John Henry St 7 D 1080.0 Henry St 7 D NaN
7 Maria Mc-Kenzie Av 40P b.1 1190.0 Mc-Kenzie Av 40P 1
8 Charles Neptune Av 14 15 b.G 1040.0 Neptune Av 14 15 G
9 Helen NaN NaN NaN NaN NaN
Upvotes: 2
Reputation: 150785
You can try this pattern:
pattern = "^(?P<street>.+)\s+(?P<number>[\d\w]+)\s+b\.(?P<box>.*)$"
df['address'].str.extract(pattern)
output
street number box
0 NaN NaN NaN
1 Yellow Av 11 F1
2 NaN NaN NaN
3 NaN NaN NaN
4 Hamilton St 159 A/B
5 NaN NaN NaN
6 NaN NaN NaN
7 Mc-Kenzie Av 40P 1
8 Neptune Av 14 15 G
9 NaN NaN NaN
For an explanation, paste the pattern here.
If you require the street be strictly without numbers, e.g. line 8 above, use this pattern:
pattern = "^(?P<street>[\D]+)\s+(?P<number>[\w\s]+)\s+b\.(?P<box>.*)$"
which gives:
street number box
0 NaN NaN NaN
1 Yellow Av 11 F1
2 NaN NaN NaN
3 NaN NaN NaN
4 Hamilton St 159 A/B
5 NaN NaN NaN
6 NaN NaN NaN
7 Mc-Kenzie Av 40P 1
8 Neptune Av 14 15 G
9 NaN NaN NaN
Upvotes: 2