Hong
Hong

Reputation: 594

How to remove all parentheses from a vector of string except when the parenthesis encompasses the entire string?

I need to clean parentheses out of a vector of strings, but leave the parenthesis if it encompasses the entire value of the string. For example, I want to clean the following

strvec <- c("Apple(Inc)", "(*1)Apple(Inc)", "((*1)Samsung(Inc))", "Samsung", "(Ford Co.(London))")

so that I get the following vector of string :

c("Apple", "Apple", "(Samsung)", "Samsung", "(Ford Co.)")

The original data is a large vector (column inside a dataframe with > million rows) with a variety of values. Any suggestion will be appreciated!

Following @rawr's comment for more examples:

strvec2 <- c("((Ford Co.(London)) subsidiary)", "Apple(Inc(*1))")

should be cleaned as

c("( subsidiary)", "Apple")

Upvotes: 0

Views: 104

Answers (1)

wp78de
wp78de

Reputation: 18950

It requires a fairly complicated regex to meet all your requirements:

^\([^\(\)]+\)(?!$)|^\((?=.+\)$)(*SKIP)(*FAIL)|\((?:[^\(\)]|(?R))*\)

Regex Demo

Using it with gsub

gsub('^\\([^\\(\\)]+\\)(?!$)|^\\((?=.+\\)$)(*SKIP)(*FAIL)|\\((?:[^\\(\\)]|(?R))*\\)', '', strvec, perl = TRUE)

Code Demo

This contains a lot of nuts and bolts I guess you can further optimize it but it should do the trick.

Explanation

The first two major alternations deal with the special case but leave the parenthesis if it encompasses the entire value of the string..

  • If there's a pair of parentheses that encloses the string from start to end we do not want to match: ^\((?=.+\)$)(*SKIP)(*FAIL)

  • but match if there's a parenthesis at the beginning that is closed earlier:
    ^\([^\(\)]+\)(?!$)

The remainder is a recursive pattern that captures the nested parenthesis.

Upvotes: 1

Related Questions