Reputation: 661
Below is a subset of a pandas
dataframe
with a column like the below
No Name
0 1 SOU 01 Sungai Dingin
1 2 PKS 2
2 3 Mill 3
3 4 Tanah Kerajaan Mill
4 5 MAS POM
5 6 SOU 20 Chaah
6 7 SOU 03 Elphil Mill
7 8 SOU 08 East Mill
8 9 SOU 04 Flemington POM
9 10 SOU 30A Jeleta Bumi
10 11 SOU 30B Mostyn
11 12 KLK - Mill 02
12 13 Chini 02 POM
13 14 SOU 05 Selaba POM
14 15 SOU 9A Sepang Mill
I am trying to figure out the best way to use regex in python to easily remove just the 'SOU XX'
or 'SOU XXX'
combination of string and numbers in that column without affecting the other text in the column?
The output would be something like the below:
No Name
0 1 Sungai Dingin
1 2 PKS 2
2 3 Mill 3
3 4 Tanah Kerajaan Mill
4 5 MAS POM
5 6 Chaah
6 7 Elphil Mill
7 8 East Mill
8 9 Flemington POM
9 10 Jeleta Bumi
10 11 Mostyn
11 12 KLK - Mill 02
12 13 Chini 02 POM
13 14 Selaba POM
14 15 Sepang Mill
Upvotes: 0
Views: 314
Reputation: 43126
You can use the regex ^SOU \S{2,3}
(note the trailing space at the end) with str.replace
:
df['Name'] = df['Name'].str.replace(r'^SOU \S{2,3} ', '')
Result:
No Name
0 1 Sungai Dingin
1 2 PKS 2
2 3 Mill 3
3 4 Tanah Kerajaan Mill
4 5 MAS POM
5 6 Chaah
6 7 Elphil Mill
7 8 East Mill
8 9 Flemington POM
9 10 Jeleta Bumi
10 11 Mostyn
11 12 KLK - Mill 02
12 13 Chini 02 POM
13 14 Selaba POM
14 15 Sepang Mill
The regex ^SOU \S{2,3}
matches the letters "SOU" plus any two or three non-space characters \S
, but only if they appear at the start of the string thanks to the ^
anchor.
Upvotes: 2