Reputation: 594
I need to clean parentheses out of a vector of strings, but leave the parenthesis if it encompasses the entire value of the string. For example, I want to clean the following
strvec <- c("Apple(Inc)", "(*1)Apple(Inc)", "((*1)Samsung(Inc))", "Samsung", "(Ford Co.(London))")
so that I get the following vector of string :
c("Apple", "Apple", "(Samsung)", "Samsung", "(Ford Co.)")
The original data is a large vector (column inside a dataframe with > million rows) with a variety of values. Any suggestion will be appreciated!
Following @rawr's comment for more examples:
strvec2 <- c("((Ford Co.(London)) subsidiary)", "Apple(Inc(*1))")
should be cleaned as
c("( subsidiary)", "Apple")
Upvotes: 0
Views: 104
Reputation: 18950
It requires a fairly complicated regex to meet all your requirements:
^\([^\(\)]+\)(?!$)|^\((?=.+\)$)(*SKIP)(*FAIL)|\((?:[^\(\)]|(?R))*\)
Using it with gsub
gsub('^\\([^\\(\\)]+\\)(?!$)|^\\((?=.+\\)$)(*SKIP)(*FAIL)|\\((?:[^\\(\\)]|(?R))*\\)', '', strvec, perl = TRUE)
This contains a lot of nuts and bolts I guess you can further optimize it but it should do the trick.
Explanation
The first two major alternations deal with the special case but leave the parenthesis if it encompasses the entire value of the string..
If there's a pair of parentheses that encloses the string from start to end we do not want to match: ^\((?=.+\)$)(*SKIP)(*FAIL)
but match if there's a parenthesis at the beginning that is closed earlier:
^\([^\(\)]+\)(?!$)
The remainder is a recursive pattern that captures the nested parenthesis.
Upvotes: 1