Reputation: 81
I need help from some Regex Guru's. I have been struggling on this one for a little while and can't get it working as intended.
This is my regex patter at the moment - it get everything between ',' and ')'
df['regex_2'] = df['name'].str.extract(r'\,(.*?)\)')
Text 123 (SDC, XUJ)
Text BCD (AUD)
Text 123 (AUD, XTJ)
Text BCSS (AUD,TACT,HGI7649AU)
` XUJ`
``
` XTJ`
`TACT,HGI7649AU`
However, what I need is all characters after the last comma before the bracket. Please see examples below.
Text 123 (SDC, XUJ)
Text BCD (AUD)
Text 123 (AUD, XTJ)
Text BCSS (AUD,TACT,HGI7649AU)
`XUJ`
``
`XTJ`
`HGI7649AU`
Upvotes: 1
Views: 62
Reputation: 257
The pattern used matches any character after the comma, including commas themselves:
r'\,(.*?)\)'
In the following test case this yields both tokens after the first comma because ,
is a matching character:
Text BCSS (AUD,TACT,HGI7649AU) -> TACT,HGI7649AU
One way to achieve the goal of only capturing the token after the last comma and before the parenthesis is to instead match on all characters excluding commas by using the syntax [^,]
:
r',\s*([^,]*)\)'
\s
is added to match zero or more occurrences of space so that they are not included in the capture group^,
is interpreted as "all characters except ,
"?
is removed from the capture group since the preceding token is already optionalExample:
In: re.search(r',\s*([^,]*)\)', 'Text BCSS (AUD,TACT,HGI7649AU)').group(1)
Out: 'HGI7649AU'
Upvotes: 1
Reputation: 627536
In case you prefer using Series.str.replace
, you can use
import pandas as pd
df = pd.DataFrame({'name': ["Text 123 (SDC, XUJ)", "Text BCD (AUD)", "Text 123 (AUD, XTJ)", "Text BCSS (AUD,TACT,HGI7649AU)"]})
df['result'] = df['name'].str.replace(r'.*,\s*([^()]*).*|.+', r'\1', regex=True)
# => df
# name result
# 0 Text 123 (SDC, XUJ) XUJ
# 1 Text BCD (AUD)
# 2 Text 123 (AUD, XTJ) XTJ
# 3 Text BCSS (AUD,TACT,HGI7649AU) HGI7649AU
See the regex demo. Details:
.*,\s*([^()]*).*
- any zero or more chars other than line break chars, as many as possible, then a comma, then zero or more whitespaces, then Group 1 capturing zero or more chars other than (
and )
, and then the rest of the line|
- or.+
- one or more chars other than line break chars, as many as possible.The replacement is \1
, the value of Group 1.
Upvotes: 0
Reputation: 163632
To get the value "after the last comma before the bracket." you can use a capture group:
\([^()]*,([^()]+)\)
In parts, the pattern matches
\(
Match the opening parenthesis[^()]*,
Match any char except (
and )
and then match a comma(
Capture group 1
[^()]+
Match 1+ times any char except (
and )
)
Close group 1\)
Match the closing parenthesisExample
strings = [
"Text 123 (SDC, XUJ)",
"Text BCD (AUD)",
"Text 123 (AUD, XTJ)",
"Text BCSS (AUD,TACT,HGI7649AU)",
]
df = pd.DataFrame(strings, columns=["name"])
df['regex_2'] = df['name'].str.extract(r'\([^()]*,([^()]+)\)')
df = df.fillna('')
print(df)
Output
name regex_2
0 Text 123 (SDC, XUJ) XUJ
1 Text BCD (AUD)
2 Text 123 (AUD, XTJ) XTJ
3 Text BCSS (AUD,TACT,HGI7649AU) HGI7649AU
Upvotes: 0