Reputation: 81

Regex Expression help - python

I need help from some Regex Guru's. I have been struggling on this one for a little while and can't get it working as intended.

This is my regex patter at the moment - it get everything between ',' and ')'

df['regex_2'] = df['name'].str.extract(r'\,(.*?)\)')

Text 123 (SDC, XUJ)
Text BCD (AUD)
Text 123 (AUD, XTJ)
Text BCSS (AUD,TACT,HGI7649AU)


` XUJ`
``
` XTJ`
`TACT,HGI7649AU`

However, what I need is all characters after the last comma before the bracket. Please see examples below.

Text 123 (SDC, XUJ)
Text BCD (AUD)
Text 123 (AUD, XTJ)
Text BCSS (AUD,TACT,HGI7649AU)


`XUJ`
`` 
`XTJ`
`HGI7649AU`

Upvotes: 1

Answers (3)

Andre Nasri

Reputation: 257

The pattern used matches any character after the comma, including commas themselves:

r'\,(.*?)\)'

In the following test case this yields both tokens after the first comma because , is a matching character:

Text BCSS (AUD,TACT,HGI7649AU) -> TACT,HGI7649AU

One way to achieve the goal of only capturing the token after the last comma and before the parenthesis is to instead match on all characters excluding commas by using the syntax [^,]:

r',\s*([^,]*)\)'

\s is added to match zero or more occurrences of space so that they are not included in the capture group
^, is interpreted as "all characters except ,"
? is removed from the capture group since the preceding token is already optional

Example:

In: re.search(r',\s*([^,]*)\)', 'Text BCSS (AUD,TACT,HGI7649AU)').group(1)

Out: 'HGI7649AU'

Upvotes: 1

Wiktor Stribiżew

Reputation: 627536

In case you prefer using Series.str.replace, you can use

import pandas as pd
df = pd.DataFrame({'name': ["Text 123 (SDC, XUJ)", "Text BCD (AUD)", "Text 123 (AUD, XTJ)", "Text BCSS (AUD,TACT,HGI7649AU)"]})
df['result'] = df['name'].str.replace(r'.*,\s*([^()]*).*|.+', r'\1', regex=True)
# => df
#                              name     result
# 0             Text 123 (SDC, XUJ)        XUJ
# 1                  Text BCD (AUD)           
# 2             Text 123 (AUD, XTJ)        XTJ
# 3  Text BCSS (AUD,TACT,HGI7649AU)  HGI7649AU

See the regex demo. Details:

.*,\s*([^()]*).* - any zero or more chars other than line break chars, as many as possible, then a comma, then zero or more whitespaces, then Group 1 capturing zero or more chars other than ( and ), and then the rest of the line
| - or
.+ - one or more chars other than line break chars, as many as possible.

The replacement is \1, the value of Group 1.

Upvotes: 0

The fourth bird

Reputation: 163632

To get the value "after the last comma before the bracket." you can use a capture group:

\([^()]*,([^()]+)\)

In parts, the pattern matches

\( Match the opening parenthesis
[^()]*, Match any char except ( and ) and then match a comma
( Capture group 1
- [^()]+ Match 1+ times any char except ( and )
) Close group 1
\) Match the closing parenthesis

Regex demo | Python demo

Example

strings = [
    "Text 123 (SDC, XUJ)",
    "Text BCD (AUD)",
    "Text 123 (AUD, XTJ)",
    "Text BCSS (AUD,TACT,HGI7649AU)",
]

df = pd.DataFrame(strings, columns=["name"])
df['regex_2'] = df['name'].str.extract(r'\([^()]*,([^()]+)\)')
df = df.fillna('')
print(df)

Output

                             name    regex_2
0             Text 123 (SDC, XUJ)        XUJ
1                  Text BCD (AUD)           
2             Text 123 (AUD, XTJ)        XTJ
3  Text BCSS (AUD,TACT,HGI7649AU)  HGI7649AU

Upvotes: 0

Regex Expression help - python

Answers (3)

Related Questions