SpikETidE
SpikETidE

Reputation: 6941

Python Regex - Match a character without consuming it

I would like to convert the following string

"For "The" Win","Way "To" Go"

to

"For ""The"" Win","Way ""To"" Go"

The straightforward regex would be

str2 = re.sub(r'(?<!,|^)"(?=\w)|(?<=\w)"(?!,|$)', '""', str1,flags=re.MULTILINE)

i.e., Double the quotes that are

  1. Followed by a letter but not preceded by a comma or the beginning of line
  2. Preceded by a letter but not followed by a comma or the end of line

The problem is I am using python and it's regex engine does not allow using the OR operator in the lookbehind construct. I get the error

sre_constants.error: look-behind requires fixed-width pattern

What I am looking for is a regex that will replace the '"' around 'The' and 'To' with '""'. I can use the following regex (An answer provided to another question)

\b\s*"(?!,|[ \t]*$)

but that consumes the space just before the 'The' and 'To' and I get the below

"For""The"" Win","Way""To"" Go"

Is there a workaround so that I can double the quotes around 'The' and 'To' without consuming the spaces just before them?

Upvotes: 2

Views: 3178

Answers (5)

Alan Moore
Alan Moore

Reputation: 75222

Looks to me like you don't need to bother with anchors.

  • If there is a character before the quote, you know it's not at the beginning of the string.
  • If that character is not a newline, you're not at the beginning of a line.
  • If the character is not a comma, you're not at the beginning of a field.

So you don't need to use anchors, just do a positive lookbehind/lookahead for a single character:

result = re.sub(r'(?<=[^",\r\n])"(?=[^,"\r\n])', '""', subject)

I threw in the " on the chance that there might be some quotes that are already escaped. But realistically, if that's the case you're probably screwed anyway. ;)

Upvotes: 2

eyquem
eyquem

Reputation: 27575

str2 = re.sub('(?<=[^,])"(?=\w)'
              '|'
              '(?<=\w)"(?!,|$)',

              '""',  ss,
              flags=re.MULTILINE)

I always wonder why people use raw strings for regex patterns when it isn't needed.

Note I changed your str which is the name of a builtin class to ss

.

For `"fun" :

str2 = re.sub('"'
              '('
              '(?<=[^,]")(?=\w)'
              '|'
              '(?<=\w")(?!,|$)'
              ')',

              '""', ss,
              flags=re.MULTILINE)

or also

str2 = re.sub('(?<=[^,]")(?=\w)'
              '|'
              '(?<=\w")(?!,|$)',

              '"',  ss,
              flags=re.MULTILINE)

Upvotes: 0

roippi
roippi

Reputation: 25954

Most direct workaround whenever you encounter this issue: explode the look-behind into two look-behinds.

str2 = re.sub(r'(?<!,)(?<!^)"(?=\w)|(?<=\w)"(?!,|$)', '""', str1,flags=re.MULTILINE)

(don't name your strings str)

Upvotes: 1

Markus Jarderot
Markus Jarderot

Reputation: 89171

re.sub(r'\b(\s*)"(?!,|[ \t]*$)', r'\1""', s)

Upvotes: 1

perreal
perreal

Reputation: 97918

Instead of saying not preceded by comma or the line start, say preceded by a non-comma character:

r'(?<=[^,])"(?=\w)|(?<=\w)"(?!,|$)'

Upvotes: 2

Related Questions