Reputation: 9793
s <- "height (female), weight, BRCA1, height (female), BRCA1, weight, body mass index, body mass index, weight, weight, height (e.g. by kilos), body mass index."
> unique(strsplit(s, ",")[[1]])
[1] "height (female)" " weight" " BRCA1" " height (female)" " body mass index" " height (e.g. by kilos)" " body mass index."
I have a string that has the following structure: <string>, <string>, <string>, ..., <string>.
Each <string>
is separated by a comma, except the last one, which is followed by a period. I want to remove the duplicated strings using regular expressions. A string can take on one of the following three formats:
(...)
, e.g. height (female)
or height (e.g. by kilos)
weight
or BRCA1
body mass index
My desired output is:
"height (female), weight, BRCA1, body mass index, height (e.g. by kilos)."
Simply doing a strsplit
on the comma doesn't account for the special cases where there is a space right before the second occurrence of height (female)
or when the last body mass index
is followed by a period.
Upvotes: 2
Views: 109
Reputation: 7600
@thelatemail's comment has you pointed in the right direction. Use unlist(strsplit(x = <input string>, split = <regex pattern>))
to pull out the comma and space. unique
pulls out the duplicates, and paste(<character vector>, collapse = ", ")
puts everything back together. Don't forget to unlist
, or unique will look for different elements of the list rather than the character vector.
# input
s <- "height (female), weight, BRCA1, height (female), BRCA1, weight, body mass index, body mass index, weight, weight, height (e.g. by kilos), body mass index."
# code
paste(unique(unlist(strsplit(s, ",\\s+|\\.$"))), collapse = ", ")
# [1] "height (female), weight, BRCA1, body mass index, height (e.g. by kilos)"
Upvotes: 3
Reputation: 60060
As long as you don't have to escape any commas in the input, and the format is known (e.g. the string ends with a period that should be stripped), this can be done in a few simple steps:
library(stringr)
s_unique = s %>%
str_remove("\\.$") %>%
str_split(",", simplify = TRUE) %>%
str_trim() %>% # Trim whitespace
unique()
paste0(s_unique, collapse = ", ")
Output:
[1] "height (female), weight, BRCA1, body mass index, height (e.g. by kilos)"
Upvotes: 2