I am looking to remove the non-numeric characters within a certain parentheses, and remove other parentheses in that line. Look below for an example; text <- c("1110383 Project something 11/22/2019 (WSO) (89021-design) John Doe (John.Doe@company22.com)", "1110383 Project something 11/22/2019 ASP (890212-wso) John Doe (John.Doe@company22.com) Other Stuff", "1110383 Project something SD (890212) John Doe (John.Doe@company22.com)") The expected output would be: cat(paste0(myoutxt, collapse = "\n")) # 1110383 Project something 11/22/2019 WSO (89021) # John Doe (John.Doe@company22.com) # 1110383 Project something 11/22/2019 ASP (890212) # John Doe (John.Doe@company22.com) # 1110383 Project something SD (890212) # John Doe (John.Doe@company22.com) I came up with a regex that matches my 5 or 6 digit number, but I am not sure what should be the replacement. Also I think the following should be modified since it doesn't consider possible presence of other parentheses to remove them. ^.*?\$[^\\d]*(\\d{5,6})[^\\d]*\$.*$ Logic: Basically, I am looking to find the line with a 5-6 digit number (e.g. 89021 or 890212 ) between parentheses. Then, if there are other stuff within that parentheses, I want to remove them (e.g. -design or -wso ). And lastly, if there are other parentheses in that specific line (e.g. (WSO) ) I want the parentheses, and not the word, to be removed.

How about substituting (?:$([^)\d]+)$(.*?))?$[^\d)]*(\d{5,6})[^\d)]*$ to $1$2($3) (?:$([^)\d]+)$(.*?))? the first optional part captures any preceding parenthesized stuff to $1 . Anything that might follow before the parenthesized 5-6 digit part is captured to $2 $[^\d)]*(\d{5,6})[^\d)]*$ the second part captures the 5-6 digits to $3 See the demo at regex101 In r using gsub : gsub(pattern='(?:\$([^)\\d]+)\$(.*?))?\$[^\\d)(]*(\\d{5,6})[^\\d)(]*\$', replacement='\\1\\2(\\3)', x=text, perl=TRUE, fixed = FALSE)

rregexgsub

M--

Reputation: 29238

Remove non-numeric characters within parantheses

I am looking to remove the non-numeric characters within a certain parentheses, and remove other parentheses in that line. Look below for an example;

text <- c("1110383 Project something 11/22/2019 (WSO) (89021-design)
John Doe ([email protected])",
          "1110383 Project something 11/22/2019 ASP (890212-wso)
John Doe ([email protected])
Other Stuff",
          "1110383 Project something SD (890212)
John Doe ([email protected])")

The expected output would be:

cat(paste0(myoutxt, collapse = "\n"))
# 1110383 Project something 11/22/2019 WSO (89021)
# John Doe ([email protected])
# 1110383 Project something 11/22/2019 ASP (890212)
# John Doe ([email protected])
# 1110383 Project something SD (890212)
# John Doe ([email protected])

I came up with a regex that matches my 5 or 6 digit number, but I am not sure what should be the replacement. Also I think the following should be modified since it doesn't consider possible presence of other parentheses to remove them.

^.*?\\([^\\d]*(\\d{5,6})[^\\d]*\\).*$

Logic:

Basically, I am looking to find the line with a 5-6 digit number (e.g. 89021 or 890212) between parentheses. Then, if there are other stuff within that parentheses, I want to remove them (e.g. -design or -wso). And lastly, if there are other parentheses in that specific line (e.g. (WSO)) I want the parentheses, and not the word, to be removed.

Upvotes: 2

Answers (3)

bobble bubble

Reputation: 18555

How about substituting

(?:\(([^)\d]+)\)(.*?))?\([^\d)]*(\d{5,6})[^\d)]*\)

$1$2($3)

(?:$([^)\d]+)$(.*?))? the first optional part captures any preceding parenthesized stuff to $1. Anything that might follow before the parenthesized 5-6 digit part is captured to $2
$[^\d)]*(\d{5,6})[^\d)]*$ the second part captures the 5-6 digits to $3

See the demo at regex101

In r using gsub:

gsub(pattern='(?:\\(([^)\\d]+)\\)(.*?))?\\([^\\d)(]*(\\d{5,6})[^\\d)(]*\\)', 
         replacement='\\1\\2(\\3)', 
         x=text, 
         perl=TRUE, fixed = FALSE)

Upvotes: 1

niko

Reputation: 5281

Here is a lateral approach

fun_0 <- function(string) {
  vec <- strsplit(string, '\\(|\\)', perl = TRUE)[[1L]]
  s <- ifelse(startsWith(string, '('), 1L, 2L)
  e <- length(vec)
  if (s > e)
    return(vec)
  inside_brackets <- seq(s, e, 2L)
  vec[inside_brackets] <- gsub('\\D*(\\d{4,5})\\D*', '(\\1)', vec[inside_brackets])  
  paste(vec, collapse = '')  
}
fun_1 <- function(string_vec) {
  to_process <- grepl('\\d{4,}', string_vec)
  string_vec[to_process] <- vapply(string_vec[to_process], fun_0, character(1))
  paste(string_vec, collapse = '\n')
}
fun_2 <- function(text) {
    string_list <- strsplit(text, '\n')
    vapply(string_list, fun_1, character(1))
}

Examples

text <- c("1110383 Project something 11/22/2019 (WSO) (89021-design)\nJohn Doe ([email protected])",
          "1110383 Project something 11/22/2019 ASP (890212-wso)\nJohn Doe ([email protected])\nOther Stuff",
          "1110383 Project something SD (890212)\nJohn Doe ([email protected])")
fun_2(text)
# [1] "1110383 Project something 11/22/2019 WSO (89021)\nJohn Doe ([email protected])"                  
# [2] "1110383 Project something 11/22/2019 ASP (89021)2-wso\nJohn Doe ([email protected])\nOther Stuff"
# [3] "1110383 Project something SD (89021)2\nJohn Doe ([email protected])"

Upvotes: 0

Biblot

Reputation: 705

Is this what you want?

"\$([^0-9@]*)\$": Remove parentheses from anything that doesn't contain a number or @
"\$(\\d{5,6}).*\$": For parentheses containing 5 to 6 numbers + anything else, leave only the numbers.

I assumed the other set of parentheses would always contain email addresses.

library(stringr)

cat(
  paste0(
    str_replace(
      str_replace(text, "\\(([^0-9@]*)\\)", "\\1"), 
      "\\((\\d{5,6}).*\\)", 
      "\\1"), 
    collapse = "\n"
  )
)

# 1110383 Project something 11/22/2019 WSO (89021)
# John Doe ([email protected])
# 1110383 Project something 11/22/2019 ASP (890212)
# John Doe ([email protected])
# Other Stuff
# 1110383 Project something SD (890212)
# John Doe ([email protected])

Upvotes: 0

Remove non-numeric characters within parantheses

Logic:

Answers (3)

Related Questions