Reputation: 197
I've got a pandas df column called 'Raw' for which the format is inconsistent. The strings it contains look like that:
'(1T XXX, Europe)'
'(2T YYYY, Latin America)'
'(3T ZZ/ZZZZ, Europe)'
'(4T XXX XXX, Africa)'
The only thing consistent in the strings in 'Raw' is that they start with a digit, includes a comma in the middle followed by a whitespace, and they contain parentheses as well.
Now, I'd like to create two extra columns (Model and Region) in my dataframe:
How do I do that using regex?
Upvotes: 3
Views: 10005
Reputation: 8816
Simply you can try below:
df
raw
0 (1T XXX, Europe)
1 (2T YYYY, Latin America)
2 (3T ZZ/ZZZZ, Europe)
3 (4T XXX XXX, Africa)
using str.extract
with regex
.
df = df.raw.str.extract(r'\((.*), (.*)\)').rename(columns={0:'Model', 1:'Region'})
print(df)
Model Region
0 1T XXX Europe
1 2T YYYY Latin America
2 3T ZZ/ZZZZ Europe
3 4T XXX XXX Africa
str.replace()
+
str.split()
with rename
.
df = df.raw.str.replace('[(|)]' , '').str.split(',', expand=True).rename(columns={0:'Model', 1:'Region'})
print(df)
Model Region
0 1T XXX Europe
1 2T YYYY Latin America
2 3T ZZ/ZZZZ Europe
3 4T XXX XXX Africa
However, if you want to retain the original Column as well then, you can opt the below method:
df[['Model', 'Region' ]] = df.raw.str.replace('[(|)]' , '').str.split(',', expand=True)
print(df)
raw Model Region
0 (1T XXX, Europe) 1T XXX Europe
1 (2T YYYY, Latin America) 2T YYYY Latin America
2 (3T ZZ/ZZZZ, Europe) 3T ZZ/ZZZZ Europe
3 (4T XXX XXX, Africa) 4T XXX XXX Africa
OR
df[['Model', 'Region' ]] = df.raw.str.extract(r'\((.*), (.*)\)')
print(df)
raw Model Region
0 (1T XXX, Europe) 1T XXX Europe
1 (2T YYYY, Latin America) 2T YYYY Latin America
2 (3T ZZ/ZZZZ, Europe) 3T ZZ/ZZZZ Europe
3 (4T XXX XXX, Africa) 4T XXX XXX Africa
Upvotes: 0
Reputation: 165
If the comma is a reliable separator of your string parts, then you do not need regexp. If df is your dataframe:
df['Model'] = [x.split(',')[0].replace('(', '') for x in df['Raw']]
df['Region'] = [x.split(',')[1].replace(')', '') for x in df['Raw']]
if you want to use regexp is would look something like:
s = '(1T XXX, Europe)'
m = re.match('\(([\w\s]+),([\w\s]+)\)', s)
model = m.group(1)
region = m.group(2)
Upvotes: 0
Reputation: 602
string_list = ['(1T XXX, Europe)',
'(2T YYYY, Latin America)',
'(3T ZZ/ZZZZ, Europe)',
'(4T XXX XXX, Africa)']
df = pd.DataFrame(string_list)
df = df[0].str.extract("\(([^,]*), ([^)]*)\)", expand=False)
Upvotes: 0
Reputation: 27
Model=re.findall(r"(?<=\().+(?=\,)",s)
Region=re.findall(r"(?<=\, ).+(?=\))",s)
The first regex checks for opening bracket "(" in front of the model and closing ",". The second regex checks for any string between "," and ")".
Upvotes: 0
Reputation: 3130
Since there's only one comma, and everything is between parentheses, in your case, use .str.split()
instead, after slicing appropriately:
model_region = df.Raw.str[1:-1].str.split(', ', expand = True)
But if you insist:
model_region = df.Raw.str.extract('\((.*), (.*)\)', expand = True)
Then
df['Model'] = model_region[0]
df['Region'] = model_region[1]
Upvotes: 5
Reputation: 1420
import re
s = '(3T ZZ/ZZZZ, Europe)'
m=re.search(r'\((.*), (.*)\)',s)
print(m.groups())
Upvotes: 0
Reputation: 1815
Try this : \(([^,]*), ([^)]*)\)
See : https://regex101.com/r/fCetWg/1
Upvotes: 1