Adrian
Adrian

Reputation: 9793

regular expressions to remove repeated strings

s <- "height (female), weight, BRCA1, height (female), BRCA1, weight, body mass index, body mass index, weight, weight, height (e.g. by kilos), body mass index." 

> unique(strsplit(s, ",")[[1]])
[1] "height (female)"         " weight"                 " BRCA1"                  " height (female)"        " body mass index"        " height (e.g. by kilos)"      " body mass index." 

I have a string that has the following structure: <string>, <string>, <string>, ..., <string>.

Each <string> is separated by a comma, except the last one, which is followed by a period. I want to remove the duplicated strings using regular expressions. A string can take on one of the following three formats:

  1. word followed by (...), e.g. height (female) or height (e.g. by kilos)
  2. a single word: e.g. weight or BRCA1
  3. multiple words separated by a space, e.g. body mass index

My desired output is:

"height (female), weight, BRCA1, body mass index, height (e.g. by kilos)."

Simply doing a strsplit on the comma doesn't account for the special cases where there is a space right before the second occurrence of height (female) or when the last body mass index is followed by a period.

Upvotes: 2

Views: 109

Answers (2)

De Novo
De Novo

Reputation: 7600

@thelatemail's comment has you pointed in the right direction. Use unlist(strsplit(x = <input string>, split = <regex pattern>)) to pull out the comma and space. unique pulls out the duplicates, and paste(<character vector>, collapse = ", ") puts everything back together. Don't forget to unlist, or unique will look for different elements of the list rather than the character vector.

# input
s <- "height (female), weight, BRCA1, height (female), BRCA1, weight, body mass index, body mass index, weight, weight, height (e.g. by kilos), body mass index." 

# code
paste(unique(unlist(strsplit(s, ",\\s+|\\.$"))), collapse = ", ")
# [1] "height (female), weight, BRCA1, body mass index, height (e.g. by kilos)"

Upvotes: 3

Marius
Marius

Reputation: 60060

As long as you don't have to escape any commas in the input, and the format is known (e.g. the string ends with a period that should be stripped), this can be done in a few simple steps:

library(stringr)
s_unique = s %>%
    str_remove("\\.$") %>%
    str_split(",", simplify = TRUE) %>%
    str_trim() %>%  # Trim whitespace
    unique()

paste0(s_unique, collapse = ", ")

Output:

[1] "height (female), weight, BRCA1, body mass index, height (e.g. by kilos)"

Upvotes: 2

Related Questions