carozimm
carozimm

Reputation: 109

Replace second occurrence of a string in one column based on value in other column in R

Here is a sample dataframe:

a <- c("cat", "dog", "mouse")
b <- c("my cat is a tabby cat and is a friendly cat", "walk the dog", "the mouse is scared of the other mouse")
df <- data.frame(a,b)

I'd like to be able to remove the second occurrence of the value in col a in col b.

Here is my desired output:

a      b
cat    my cat is a tabby and is a friendly cat
dog    walk the dog
mouse  the mouse is scared of the other

I've tried different combinations of gsub and some stringr functions, but I haven't even gotten close to being able to remove the second (and only the second) occurrence of the string in col a in col b. I think I'm asking something similar to this one, but I'm not familiar with Perl and couldn't translate it to R.

Thanks!

Upvotes: 1

Views: 1199

Answers (4)

hello_friend
hello_friend

Reputation: 5788

Base R, split-apply-combine solution:

# Split-apply-combine: 

data.frame(do.call("rbind", lapply(split(df, df$a), function(x){

        b <- paste(unique(unlist(strsplit(x$b, "\\s+"))), collapse = " ")

        return(data.frame(a = x$a, b = b))

      }

    )

  ), 

  stringsAsFactors = FALSE, row.names = NULL

)

Data:

df <- data.frame(a = c("cat", "dog", "mouse"),
                 b = c("my cat is a tabby cat and is a friendly cat", "walk the dog", "the mouse is scared of the other mouse"), 
                 stringsAsFactors = FALSE)

Upvotes: 0

Andrew Gustar
Andrew Gustar

Reputation: 18425

You could do this...

library(stringr)
df$b <- str_replace(df$b, 
                    paste0("(.*?",df$a,".*?) ",df$a), 
                    "\\1")

df
      a                                       b
1   cat my cat is a tabby and is a friendly cat
2   dog                            walk the dog
3 mouse        the mouse is scared of the other

The regex finds the first string of characters with df$a somewhere in it, followed by a space and another df$a. The capture group is the text up to the space before the second occurrence (indicated by the (...)), and the whole text (including the second occurrence) is replaced by the capture group \\1 (which has the effect of deleting the second df$a and its preceding space). Anything after the second df$a is not affected.

Upvotes: 0

carozimm
carozimm

Reputation: 109

I've actually found another solution that, though longer, may be clearer for other regex beginners:

library(stringr)
# Replace first instance of col a in col b with "INTERIM" 
df$b <- str_replace(b, a, "INTERIM")

# Now that the original first instance of col a is re-labeled to "INTERIM", I can again replace the first instance of col a in col b, this time with an empty string
df$b <- str_replace(df$b, a, "")

# And I can re-replace the re-labeled "INTERIM" to the original string in col a
df$b <- str_replace(df$b, "INTERIM", a)

# Trim "double" whitespace
df$b <- str_replace(gsub("\\s+", " ", str_trim(df$b)), "B", "b")


df
a            b
cat          my cat is a tabby and is a friendly cat
dog          walk the dog
mouse        the mouse is scared of the other

Upvotes: 1

G5W
G5W

Reputation: 37651

It takes a little work to build the right Regex.

P1 = paste(a, collapse="|")
PAT = paste0("((", P1, ").*?)(\\2)")

sub(PAT, "\\1", b, perl=TRUE)
[1] "my cat is a tabby  and is a friendly cat"
[2] "walk the dog"                            
[3] "the mouse is scared of the other "   

Upvotes: 1

Related Questions