M--
M--

Reputation: 29238

Remove non-numeric characters within parantheses

I am looking to remove the non-numeric characters within a certain parentheses, and remove other parentheses in that line. Look below for an example;

text <- c("1110383 Project something 11/22/2019 (WSO) (89021-design)
John Doe ([email protected])",
          "1110383 Project something 11/22/2019 ASP (890212-wso)
John Doe ([email protected])
Other Stuff",
          "1110383 Project something SD (890212)
John Doe ([email protected])")

The expected output would be:

cat(paste0(myoutxt, collapse = "\n"))
# 1110383 Project something 11/22/2019 WSO (89021)
# John Doe ([email protected])
# 1110383 Project something 11/22/2019 ASP (890212)
# John Doe ([email protected])
# 1110383 Project something SD (890212)
# John Doe ([email protected])

I came up with a regex that matches my 5 or 6 digit number, but I am not sure what should be the replacement. Also I think the following should be modified since it doesn't consider possible presence of other parentheses to remove them.

^.*?\\([^\\d]*(\\d{5,6})[^\\d]*\\).*$

Logic:

Basically, I am looking to find the line with a 5-6 digit number (e.g. 89021 or 890212) between parentheses. Then, if there are other stuff within that parentheses, I want to remove them (e.g. -design or -wso). And lastly, if there are other parentheses in that specific line (e.g. (WSO)) I want the parentheses, and not the word, to be removed.

Upvotes: 2

Views: 213

Answers (3)

bobble bubble
bobble bubble

Reputation: 18555

How about substituting

(?:\(([^)\d]+)\)(.*?))?\([^\d)]*(\d{5,6})[^\d)]*\)

to

$1$2($3)
  • (?:\(([^)\d]+)\)(.*?))? the first optional part captures any preceding parenthesized stuff to $1. Anything that might follow before the parenthesized 5-6 digit part is captured to $2
  • \([^\d)]*(\d{5,6})[^\d)]*\) the second part captures the 5-6 digits to $3

See the demo at regex101


In using gsub:

gsub(pattern='(?:\\(([^)\\d]+)\\)(.*?))?\\([^\\d)(]*(\\d{5,6})[^\\d)(]*\\)', 
         replacement='\\1\\2(\\3)', 
         x=text, 
         perl=TRUE, fixed = FALSE)

Upvotes: 1

niko
niko

Reputation: 5281

Here is a lateral approach

fun_0 <- function(string) {
  vec <- strsplit(string, '\\(|\\)', perl = TRUE)[[1L]]
  s <- ifelse(startsWith(string, '('), 1L, 2L)
  e <- length(vec)
  if (s > e)
    return(vec)
  inside_brackets <- seq(s, e, 2L)
  vec[inside_brackets] <- gsub('\\D*(\\d{4,5})\\D*', '(\\1)', vec[inside_brackets])  
  paste(vec, collapse = '')  
}
fun_1 <- function(string_vec) {
  to_process <- grepl('\\d{4,}', string_vec)
  string_vec[to_process] <- vapply(string_vec[to_process], fun_0, character(1))
  paste(string_vec, collapse = '\n')
}
fun_2 <- function(text) {
    string_list <- strsplit(text, '\n')
    vapply(string_list, fun_1, character(1))
}

Examples

text <- c("1110383 Project something 11/22/2019 (WSO) (89021-design)\nJohn Doe ([email protected])",
          "1110383 Project something 11/22/2019 ASP (890212-wso)\nJohn Doe ([email protected])\nOther Stuff",
          "1110383 Project something SD (890212)\nJohn Doe ([email protected])")
fun_2(text)
# [1] "1110383 Project something 11/22/2019 WSO (89021)\nJohn Doe ([email protected])"                  
# [2] "1110383 Project something 11/22/2019 ASP (89021)2-wso\nJohn Doe ([email protected])\nOther Stuff"
# [3] "1110383 Project something SD (89021)2\nJohn Doe ([email protected])" 

Upvotes: 0

Biblot
Biblot

Reputation: 705

Is this what you want?

  1. "\\(([^0-9@]*)\\)": Remove parentheses from anything that doesn't contain a number or @
  2. "\\((\\d{5,6}).*\\)": For parentheses containing 5 to 6 numbers + anything else, leave only the numbers.

I assumed the other set of parentheses would always contain email addresses.

library(stringr)

cat(
  paste0(
    str_replace(
      str_replace(text, "\\(([^0-9@]*)\\)", "\\1"), 
      "\\((\\d{5,6}).*\\)", 
      "\\1"), 
    collapse = "\n"
  )
)

# 1110383 Project something 11/22/2019 WSO (89021)
# John Doe ([email protected])
# 1110383 Project something 11/22/2019 ASP (890212)
# John Doe ([email protected])
# Other Stuff
# 1110383 Project something SD (890212)
# John Doe ([email protected])

Upvotes: 0

Related Questions