regular expressions to remove repeated strings

Question

s <- "height (female), weight, BRCA1, height (female), BRCA1, weight, body mass index, body mass index, weight, weight, height (e.g. by kilos), body mass index." 

> unique(strsplit(s, ",")[[1]])
[1] "height (female)"         " weight"                 " BRCA1"                  " height (female)"        " body mass index"        " height (e.g. by kilos)"      " body mass index."

I have a string that has the following structure: , , , ..., .

Each is separated by a comma, except the last one, which is followed by a period. I want to remove the duplicated strings using regular expressions. A string can take on one of the following three formats:

word followed by (...), e.g. height (female) or height (e.g. by kilos)
a single word: e.g. weight or BRCA1
multiple words separated by a space, e.g. body mass index

My desired output is:

"height (female), weight, BRCA1, body mass index, height (e.g. by kilos)."

Simply doing a strsplit on the comma doesn't account for the special cases where there is a space right before the second occurrence of height (female) or when the last body mass index is followed by a period.

Marius · Accepted Answer

As long as you don't have to escape any commas in the input, and the format is known (e.g. the string ends with a period that should be stripped), this can be done in a few simple steps:

library(stringr)
s_unique = s %>%
    str_remove("\.$") %>%
    str_split(",", simplify = TRUE) %>%
    str_trim() %>%  # Trim whitespace
    unique()

paste0(s_unique, collapse = ", ")

Output:

[1] "height (female), weight, BRCA1, body mass index, height (e.g. by kilos)"

regular expressions to remove repeated strings

Answers (2)

Related Questions