Reputation: 1140
I can't find a way to do this...
raw_string <- "\"+001\", la bonne surprise de M. Jenn M. Ayache via @MYTF1News"
clean_string <- "+001, la bonne surprise de Jenn Ayache"
desired_string <- "\"\"M. M. via @MYTF1News"
I am not sure about how to call this transformation. I would say "difference" (as in set theory, opposed to "union" and "intersection"). A better name could be "relative complement" (
My desired string has only and all the characters missing from the clean_string, in the good order, once for every time they appear, including spaces, punctuation and everything.
The best I managed to do isn't good enough:
> a <- paste(Reduce(setdiff, strsplit(c(raw_string, clean_string), split = " ")), collapse = " ")
> a
[1] "\"+001\", M. via @MYTF1News"
Upvotes: 6
Views: 579
Reputation: 19960
Here is a little more concise way using sub
which requires you to account for symbols.
str_relative_complement <- function(raw_string, clean_string){
words <- strsplit(clean_string, "")[[1]]
cur_str <- raw_string
for(i in words){
cur_str <- sub(ifelse(grepl("[[:punct:]]", i), paste0("\\", i), i), "", cur_str)
raw_string <- '\"+001\", la bonne surprise de M. Jenn M. Ayache via @MYTF1News'
clean_string <- "+001, la bonne surprise de Jenn Ayache"
str_relative_complement(raw_string, clean_string)
[1] "\"\"M. M. via @MYTF1News"
Upvotes: 1
Reputation: 132706
I would use a loop, too:
x <- strsplit(raw_string, "")[[1]]
y <- strsplit(clean_string, "")[[1]]
res <- character(length(x))
j <- 1
for(i in seq_along(x)) {
if (j > length(y)) {
res[i:length(x)] <- x[i:length(x)]
if (x[i] != y[j]) {
res[i] <- x[i]
} else {
j <- j + 1
paste(res, collapse = "")
#[1] "\"\"M. M. via @MYTF1News"
Note the extra space in comparison to your expected result. I think you simply missed it.
If this is too slow, it should be easy to implement with Rcpp.
Upvotes: 3
Reputation: 14346
I don't know if there is an implemented function for this in one of the string manipulation packages (I haven't come across it). This is an implementation which (I think) works
raw_string <- "\"+001\", la bonne surprise de M. Jenn M. Ayache via @MYTF1News"
clean_string <- "+001, la bonne surprise de Jenn Ayache"
raw <- strsplit(raw_string, "")[[1]]
clean <- strsplit(clean_string, "")[[1]]
dif <- vector("list")
j <- 1
while(length(clean) > 0) {
i <- match(clean[1], raw)
if (i > 1) {
dif[[j]] <- raw[seq_len(i - 1)]
j <- j + 1
clean <- clean[-1]
raw <- raw[-seq_len(i)]
dif[[j]] <- raw
paste(unlist(dif), collapse = "")
#[1] "\"\"M. M. via @MYTF1News"
Upvotes: 1