Joss
Joss

Reputation: 197

Python using regex to extract parts of a string in pandas column

I've got a pandas df column called 'Raw' for which the format is inconsistent. The strings it contains look like that:

'(1T XXX, Europe)'
'(2T YYYY, Latin America)'
'(3T ZZ/ZZZZ, Europe)'
'(4T XXX XXX, Africa)'

The only thing consistent in the strings in 'Raw' is that they start with a digit, includes a comma in the middle followed by a whitespace, and they contain parentheses as well.

Now, I'd like to create two extra columns (Model and Region) in my dataframe:

How do I do that using regex?

Upvotes: 3

Views: 10005

Answers (7)

Karn Kumar
Karn Kumar

Reputation: 8816

Simply you can try below:

Sample DataFrame:

df
                        raw
0          (1T XXX, Europe)
1  (2T YYYY, Latin America)
2      (3T ZZ/ZZZZ, Europe)
3      (4T XXX XXX, Africa)

Solution 1:

using str.extract with regex.

df = df.raw.str.extract(r'\((.*), (.*)\)').rename(columns={0:'Model', 1:'Region'})
print(df)
        Model         Region
0      1T XXX         Europe
1     2T YYYY  Latin America
2  3T ZZ/ZZZZ         Europe
3  4T XXX XXX         Africa

Solution 2:

str.replace() + str.split() with rename.

df = df.raw.str.replace('[(|)]' , '').str.split(',', expand=True).rename(columns={0:'Model', 1:'Region'})
print(df)
        Model          Region
0      1T XXX          Europe
1     2T YYYY   Latin America
2  3T ZZ/ZZZZ          Europe
3  4T XXX XXX          Africa

Note:

However, if you want to retain the original Column as well then, you can opt the below method:

df[['Model', 'Region' ]] = df.raw.str.replace('[(|)]' , '').str.split(',', expand=True)

print(df)
                        raw       Model          Region
0          (1T XXX, Europe)      1T XXX          Europe
1  (2T YYYY, Latin America)     2T YYYY   Latin America
2      (3T ZZ/ZZZZ, Europe)  3T ZZ/ZZZZ          Europe
3      (4T XXX XXX, Africa)  4T XXX XXX          Africa

OR

df[['Model', 'Region' ]] = df.raw.str.extract(r'\((.*), (.*)\)')
print(df)
                        raw       Model         Region
0          (1T XXX, Europe)      1T XXX         Europe
1  (2T YYYY, Latin America)     2T YYYY  Latin America
2      (3T ZZ/ZZZZ, Europe)  3T ZZ/ZZZZ         Europe
3      (4T XXX XXX, Africa)  4T XXX XXX         Africa

Upvotes: 0

felix the cat
felix the cat

Reputation: 165

If the comma is a reliable separator of your string parts, then you do not need regexp. If df is your dataframe:

df['Model'] = [x.split(',')[0].replace('(', '') for x in df['Raw']]
df['Region'] = [x.split(',')[1].replace(')', '') for x in df['Raw']]

if you want to use regexp is would look something like:

s = '(1T XXX, Europe)'
m = re.match('\(([\w\s]+),([\w\s]+)\)', s)
model = m.group(1)
region = m.group(2)

Upvotes: 0

Akshay Kandul
Akshay Kandul

Reputation: 602

string_list = ['(1T XXX, Europe)',
'(2T YYYY, Latin America)',
'(3T ZZ/ZZZZ, Europe)',
'(4T XXX XXX, Africa)']
df = pd.DataFrame(string_list)
df = df[0].str.extract("\(([^,]*), ([^)]*)\)", expand=False)

Upvotes: 0

Sudarshan shenoy
Sudarshan shenoy

Reputation: 27

Model=re.findall(r"(?<=\().+(?=\,)",s)
Region=re.findall(r"(?<=\, ).+(?=\))",s)

The first regex checks for opening bracket "(" in front of the model and closing ",". The second regex checks for any string between "," and ")".

Upvotes: 0

Ken Wei
Ken Wei

Reputation: 3130

Since there's only one comma, and everything is between parentheses, in your case, use .str.split() instead, after slicing appropriately:

model_region = df.Raw.str[1:-1].str.split(', ', expand = True)

But if you insist:

model_region = df.Raw.str.extract('\((.*), (.*)\)', expand = True)

Then

df['Model'] = model_region[0]
df['Region'] = model_region[1]

Upvotes: 5

K. Kirsz
K. Kirsz

Reputation: 1420

import re

s = '(3T ZZ/ZZZZ, Europe)'
m=re.search(r'\((.*), (.*)\)',s)
print(m.groups())

Upvotes: 0

Esteban
Esteban

Reputation: 1815

Try this : \(([^,]*), ([^)]*)\)

See : https://regex101.com/r/fCetWg/1

Upvotes: 1

Related Questions