Reputation: 29238
I am looking to remove the non-numeric characters within a certain parentheses, and remove other parentheses in that line. Look below for an example;
text <- c("1110383 Project something 11/22/2019 (WSO) (89021-design)
John Doe ([email protected])",
"1110383 Project something 11/22/2019 ASP (890212-wso)
John Doe ([email protected])
Other Stuff",
"1110383 Project something SD (890212)
John Doe ([email protected])")
The expected output would be:
cat(paste0(myoutxt, collapse = "\n"))
# 1110383 Project something 11/22/2019 WSO (89021)
# John Doe ([email protected])
# 1110383 Project something 11/22/2019 ASP (890212)
# John Doe ([email protected])
# 1110383 Project something SD (890212)
# John Doe ([email protected])
I came up with a regex that matches my 5 or 6 digit number, but I am not sure what should be the replacement. Also I think the following should be modified since it doesn't consider possible presence of other parentheses to remove them.
^.*?\\([^\\d]*(\\d{5,6})[^\\d]*\\).*$
Basically, I am looking to find the line with a 5-6 digit number (e.g. 89021
or 890212
) between parentheses. Then, if there are other stuff within that parentheses, I want to remove them (e.g. -design
or -wso
). And lastly, if there are other parentheses in that specific line (e.g. (WSO)
) I want the parentheses, and not the word, to be removed.
Upvotes: 2
Views: 213
Reputation: 18555
How about substituting
(?:\(([^)\d]+)\)(.*?))?\([^\d)]*(\d{5,6})[^\d)]*\)
to
$1$2($3)
(?:\(([^)\d]+)\)(.*?))?
the first optional part captures any preceding parenthesized stuff to $1
. Anything that might follow before the parenthesized 5-6 digit part is captured to $2
\([^\d)]*(\d{5,6})[^\d)]*\)
the second part captures the 5-6 digits to $3
In r using gsub
:
gsub(pattern='(?:\\(([^)\\d]+)\\)(.*?))?\\([^\\d)(]*(\\d{5,6})[^\\d)(]*\\)',
replacement='\\1\\2(\\3)',
x=text,
perl=TRUE, fixed = FALSE)
Upvotes: 1
Reputation: 5281
Here is a lateral approach
fun_0 <- function(string) {
vec <- strsplit(string, '\\(|\\)', perl = TRUE)[[1L]]
s <- ifelse(startsWith(string, '('), 1L, 2L)
e <- length(vec)
if (s > e)
return(vec)
inside_brackets <- seq(s, e, 2L)
vec[inside_brackets] <- gsub('\\D*(\\d{4,5})\\D*', '(\\1)', vec[inside_brackets])
paste(vec, collapse = '')
}
fun_1 <- function(string_vec) {
to_process <- grepl('\\d{4,}', string_vec)
string_vec[to_process] <- vapply(string_vec[to_process], fun_0, character(1))
paste(string_vec, collapse = '\n')
}
fun_2 <- function(text) {
string_list <- strsplit(text, '\n')
vapply(string_list, fun_1, character(1))
}
Examples
text <- c("1110383 Project something 11/22/2019 (WSO) (89021-design)\nJohn Doe ([email protected])",
"1110383 Project something 11/22/2019 ASP (890212-wso)\nJohn Doe ([email protected])\nOther Stuff",
"1110383 Project something SD (890212)\nJohn Doe ([email protected])")
fun_2(text)
# [1] "1110383 Project something 11/22/2019 WSO (89021)\nJohn Doe ([email protected])"
# [2] "1110383 Project something 11/22/2019 ASP (89021)2-wso\nJohn Doe ([email protected])\nOther Stuff"
# [3] "1110383 Project something SD (89021)2\nJohn Doe ([email protected])"
Upvotes: 0
Reputation: 705
Is this what you want?
"\\(([^0-9@]*)\\)"
: Remove parentheses from anything that doesn't contain a number or @
"\\((\\d{5,6}).*\\)"
: For parentheses containing 5 to 6 numbers + anything else, leave only the numbers.I assumed the other set of parentheses would always contain email addresses.
library(stringr)
cat(
paste0(
str_replace(
str_replace(text, "\\(([^0-9@]*)\\)", "\\1"),
"\\((\\d{5,6}).*\\)",
"\\1"),
collapse = "\n"
)
)
# 1110383 Project something 11/22/2019 WSO (89021)
# John Doe ([email protected])
# 1110383 Project something 11/22/2019 ASP (890212)
# John Doe ([email protected])
# Other Stuff
# 1110383 Project something SD (890212)
# John Doe ([email protected])
Upvotes: 0