Reputation: 4248
I am trying to remove quotes from a string. Example:
"hello", how 'are "you" today'
returns
hello, how are "you" today
I am using php preg_replace.
I've got a couple of solutions at the moment:
(\'|")(.*)\1
Problem with this is it matches all characters (including quotes) in the middle, so the result ($2) is
hello", how 'are "you today'
Backreferences cannot be used in character classes, so I can't use something like
(\'|")([^\1\r\n]*)\1
to not match the first backreference in the middle.
Second solution:
(\'[^\']*\'|"[^"]*")
Problem is, this includes the quotes in the back reference so doesn't actually do anything at all. The result ($1):
"hello", how 'are "you" today'
Upvotes: 3
Views: 527
Reputation: 4166
Regarding:
Backreferences cannot be used in character classes, so I can't use something like
(\'|")([^\1\r\n]*)\1
(\'|")(((?!(\1|\r|\n)).)*)\1
(where (?!...)
is a negative lookahead for ...
) should work.
I dont know whether this solves your main problem, but it does solve the "match a character iff it doesnt match a backref" part.
Missed a parenthesis, fixed.
Upvotes: 2
Reputation: 383736
Instead of:
(\'[^\']*\'|"[^"]*")
Simply write:
\'([^\']*)\'|"([^"]*)"
\______/ \_____/
1 2
Now one of the groups will match the quoted content.
In most flavor, when a group that failed to match is referred to in a replacement string, the empty string gets substituted in, so you can simply replace with $1$2
and one will be the successful capture (depending on the alternate) and the other will substitute in the empty string.
Here's a PHP implementation (as seen on ideone.com):
$text = <<<EOT
"hello", how 'are "you" today'
EOT;
print preg_replace(
'/\'([^\']*)\'|"([^"]*)"/',
'$1$2',
$text
);
# hello, how are "you" today
Let's use 1
and 2
for the quotes (for clarity). Whitespaces will also be added (for clarity).
Before, you have, as your second solution, this pattern:
( 1[^1]*1 | 2[^2]*2 )
\_______________________/
capture whole thing
content and quotes
As you correctly pointed out, this match a pair of quotes correctly (assuming that you can't escape quotes), but it doesn't capture the content part.
This may not be a problem depending on context (e.g. you can simply trim one character from the beginning and end to get the content), but at the same time, it's also not that hard to fix the problem: simply capture the content from the two possibilities separately.
1([^1]*)1 | 2([^2]*)2
\_____/ \_____/
capture contents from
each alternate separately
Now either group 1 or group 2 will capture the content, depending on which alternate was matched. As a "bonus", you can check which quote was used, i.e. if group 1 succeeded, then 1
was used.
The […]
is a character class. Something like [aeiou]
matches one of any of the lowercase vowels. [^…]
is a negated character class. [^aeiou]
matches one of anything but the lowercase vowels.
(…)
is used for grouping. (pattern)
is a capturing group and creates a backreference. (?:pattern)
is non-capturing.
Upvotes: 3
Reputation: 7997
You cannot do this with a regular expression. This requires an internal state to keep track of (among other things)
This requires a grammar-aware parser to do correctly. A regular expression engine does not keep state because it is a finite state automata, which only operates on the current input regardless of previous circumstances.
It's the same reason you cannot reliably match sets of nested parentheses or XML elements.
Upvotes: 0