Dave D.
Dave D.

Reputation: 767

How to avoid special character using python regex?

I want to extract variables from following string (i.e. names surrounded by ' ')

Case1:

string = r"RESPONSE(1, -2.532 + 0.779*(LN('Loss_Ratio')) +SELECT(INDEX_FIRST_TRUE('POL_Zero'="No"),2.261,0.0) +SELECT(INDEX_FIRST_TRUE('POL_children'="Si"),0.307,0.0))"

when I apply

all_variables = list(set(re.findall("'([^']*)'", string)))

I get correct results :

all_variables = ['Loss_Ratio','POL_Zero','POL_children']

But Case 2 (when POL_Zero modality changed)

string = r"RESPONSE(1, -2.532 + 0.779*(LN('Loss_Ratio')) +SELECT(INDEX_FIRST_TRUE('POL_Zero'="Nos' conditional"),2.261,0.0) +SELECT(INDEX_FIRST_TRUE('POL_children'="Si"),0.307,0.0))"

The same regex produces wrong result. How can I still obtain correct result in case2 as well?

Note that there can be no single or double quotes inside the names.

Upvotes: 1

Views: 79

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627537

You may leverage the fact that your single quoted strings cannot contain neither single nor double quotes.

Only in this situtation,

"""'([^"']*)'"""

regex will work as expected. See the regex demo.

Here,

  • ' - matches a single quote
  • ([^"']*) - Group 1 (if you are using re.findall', only this part will be present in the output): zero or more (*) chars other than"and'([^'"]`)
  • ' - closing single quote.

Python demo:

import re
s = """RESPONSE(1, -2.532 + 0.779*(LN('Loss_Ratio')) +SELECT(INDEX_FIRST_TRUE('POL_Zero'="No"),2.261,0.0) +SELECT(INDEX_FIRST_TRUE('POL_children'="Si"),0.307,0.0))

RESPONSE(1, -2.532 + 0.779*(LN('Loss_Ratio')) +SELECT(INDEX_FIRST_TRUE('POL_Zero'="Nos' conditional"),2.261,0.0) +SELECT(INDEX_FIRST_TRUE('POL_children'="Si"),0.307,0.0))"""
print(re.findall(r"""'([^"']*)'""", s))
# => ['Loss_Ratio', 'POL_Zero', 'POL_children', 'Loss_Ratio', 'POL_Zero', 'POL_children']

Upvotes: 1

Related Questions