Reputation: 26218
I am looking for regex (preferably in R
) which can replace (any number of) specific characters say ;
with say ;;
but only when not present inside parenthesis ()
inside the text string.
Note: 1. There may be more than one replacement character present inside parenthesis too
2. There are no nested parenthesis in the data/vector
Example
text;othertext
to be replaced with text;;othertext
text;other(texttt;some;someother);more
to be replaced with text;;other(texttt;some;someother);;more
. (i.e. ;
only outside ()
to be replaced with replacement text)Still if some clarification is needed I will try to explain
in_vec <- c("abcd;ghi;dfsF(adffg;adfsasdf);dfg;(asd;fdsg);ag", "zvc;dfasdf;asdga;asd(asd;hsfd)", "adsg;(asdg;ASF;DFG;ASDF;);sdafdf", "asagf;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;sdfa")
in_vec
#> [1] "abcd;ghi;dfsF(adffg;adfsasdf);dfg;(asd;fdsg);ag"
#> [2] "zvc;dfasdf;asdga;asd(asd;hsfd)"
#> [3] "adsg;(asdg;ASF;DFG;ASDF;);sdafdf"
#> [4] "asagf;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;sdfa"
Expected output (calculated manually)
[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
Upvotes: 7
Views: 526
Reputation: 18611
Use the following in case of no nested parentheses:
gsub("\\([^()]*\\)(*SKIP)(*FAIL)|;", ";;", in_vec, perl=TRUE)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
\( '(' char
--------------------------------------------------------------------------------
[^()]* any character except: '(', ')' (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\) ')' char
--------------------------------------------------------------------------------
(*SKIP)(*FAIL) skip current match, search for new one from here
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
; ';'
If there are nested parentheses:
gsub("(\\((?:[^()]++|(?1))*\\))(*SKIP)(*FAIL)|;", ";;", in_vec, perl=TRUE)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
[^()]++ any character except: '(', ')' (1 or more times
(matching the most amount possible, no backtracking))
--------------------------------------------------------------------------------
| or
--------------------------------------------------------------------------------
(?1) recursing first group pattern
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\) ')'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(*SKIP)(*FAIL) skip the match, search for next
--------------------------------------------------------------------------------
| or
--------------------------------------------------------------------------------
; ';'
--------------------------------------------------------------------------------
Upvotes: 1
Reputation: 41
Though the problem can be tackled with regex, using a simple function might be more straightforward and easier to understand.
replace_semicolons_outside_parentheses <- function(raw_string) {
"""Replace ; with ;; outside of parentheses"""
processed_string <- ""
n_open_parentheses <- 0
# Loops over characters in raw_string
for (char in strsplit(raw_string, "")[[1]]) {
# Update the net number of open parentheses
if (char == "(") {
n_open_parentheses <- n_open_parentheses + 1
} else if (char == ")") {
n_open_parentheses <- n_open_parentheses - 1
}
# Replace ; with ;; outside of parentheses
if (char == ";" && n_open_parentheses == 0) {
processed_string <- paste0(processed_string, ";;")
} else {
processed_string <- paste0(processed_string, char)
}
}
return(processed_string)
}
Note that the function above also works for nested parentheses: no semicolons inside nested parentheses are replaced! The desired output can be obtained in a single line:
out_vec <- lapply(in_vec, replace_semicolons_outside_parentheses)
# 1. 'abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag'
# 2. 'zvc;;dfasdf;;asdga;;asd(asd;hsfd)'
# 3. 'adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf'
# 4. 'asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa'
Upvotes: 4
Reputation: 39647
You can use gsub
with ;(?![^(]*\\))
:
gsub(";(?![^(]*\\))", ";;", in_vec, perl=TRUE)
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
;
finds ;
, (?!)
.. Negative Lookahead (make the replacement when it does not match), [^(]
.. everything but not (
, *
repeat the previous 0 to n times, \\)
.. flowed by )
.
Or
gsub(";(?=[^)]*($|\\())", ";;", in_vec, perl=TRUE)
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
;
finds ;
, (?=)
.. Positive Lookahead (make the replacement when it does match), [^)]
.. everything but not )
, *
repeat the previous 0 to n times, ($|\\()
.. match end $
or (
.
Or using gregexpr
and regmatches
extracting the part between (
and )
and making the replacement in the non-matched substrings:
x <- gregexpr("\\(.*?\\)", in_vec) #Find the part between ( and )
mapply(function(a, b) {
paste(matrix(c(gsub(";", ";;", b), a, ""), 2, byrow=TRUE), collapse = "")
}, regmatches(in_vec, x), regmatches(in_vec, x, TRUE))
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
But all of them will work only for simple open (
close )
combinations.
Upvotes: 10