Reputation: 537
I'm struggling to get a RegEx expression that matches all double-quote characters ("
) that occur within square brackets.
I have different pieces that do parts of what I want. For example,
gsub('"', "", '"""xyz"""')
[1] "xyz"
Will get all double-quotes, irrespective of anything else.
gsub('\\[(.*?)\\]', "", '[xyz][][][]abc')
[1] "abc"
Will get everything inside two square brackets, including the brackets themselves (which I do not want to happen -- how do I avoid that?).
I'm also not sure how to combine the two once I have them each working. Here's an example of the desired behavior. Given an input string ["cats", "dogs"]"x"
, I want an expression that will replace the four "
characters inside of the square brackets, but not the ones outside.
To be more specific:
gsub('THE_REGEX', "", '["cats", "dogs"]"x"')
should return
[cats, dogs]"x"
I want to remove double-quotes when they occur inside of square brackets, but not when they occur outside of square brackets.
Upvotes: 2
Views: 1026
Reputation: 89584
A \G
based pattern ensures contiguity between matches and that you are always between square brackets:
gsub('(?:\\G(?!\\A)|\\[)[^]"]*\\K"', "", '["cats", "dogs"]"x"', perl=TRUE)
Or if you want to check that the closing square bracket exists:
gsub('(?:\\G(?!\\A)|\\[(?=[^][]*]))[^]"]*\\K"', "", '["cats", "dogs"]"x"', perl=TRUE)
The \G
anchor matches the last position reached by the regex engine, this is why it can be used to ensure contiguity between matches.
The two patterns start with an alternation. One branch is used for the first match (the second one) and find the opening square bracket, then [^]"]*
reaches the last character that isn't a quote or the closing square bracket. \K
marks the position from which you want the characters to be returned from match result (that's why all that comes before isn't erased). The other branch that starts with \G
is used for the next matches (immediately after the previous only). Since [^]"]*
forbids the closing square bracket, you can't get out of the square brackets. When there's no more quotes to replace the pattern fails, the regex engine goes to the next character and so on until the second branch succeeds again (if an opening square bracket is found).
Notice: even if this way doesn't need a dependency, keep in mind that it is (from far) less simple to understand than applying a callback function on a match of the complete content between brackets as Grothendieck does it.
About the two edge cases in my comment, I think the best solution is to keep quotes that contains a closing square bracket when they are inside square brackets: https://regex101.com/r/SOMpqN/1
Upvotes: 2
Reputation: 269885
Using gsubfn
search for [...] and then pass each match to the indicated gsub function. Everything outside the match will be left as is.
library(gsubfn)
gsubfn('\\[.*?\\]', ~ gsub('"', '', x), s)
## [1] "\"abc\" [cats, dogs] \"def\"" "\"abc\" [cats, dogs] \"def\""
Test data:
s <- '"abc" ["cats", "dogs"] "def"'
s <- c(s, s)
Upvotes: 3