Polina Ermolaeva
Polina Ermolaeva

Reputation: 107

Remove one word if it appears in the string with other

I have a list of strings like this:

string <- c("tasty apple", "tasty orange", "yellow banana", "red tasty peach", "tasty banana apple", "tasty apple yellow banana", "yellow orange banana", "peach tasty apple", "yellow banana tasty peach")

When there is just one type of fruit in the string it is fine. However, when there are more than 2 of them I have a list of patterns and replacements:

pattern <- c("banana apple", "banana orange", "peach apple", "banana peach")
replacement <- c("apple", "banana", "peach", "banana")

I can remove one of fruits when they are next to each other in the string, however in my data there can be words between them and I do not know how to remove word. The order of the words in the string might differ as well.

I want it to be like this:

Before After
tasty apple tasty apple
tasty orange tasty orange
yellow banana yellow banana
red tasty peach red tasty peach
tasty banana apple tasty apple
tasty apple yellow banana tasty apple yellow
yellow orange banana yellow banana
peach tasty apple peach tasty
yellow banana tasty peach yellow banana tasty

Upvotes: 1

Views: 69

Answers (2)

r2evans
r2evans

Reputation: 161085

We can Reduce this to iteratively use gsub to replace a pattern= with a replacement=. Since you want both orders of fruits, I'm adding a few to your patterns so that we can get both. (If you have many more combinations, this can be automated using permutations.)

pattern <- c("banana(.+)apple", "banana(.+)orange", "peach(.+)apple", "banana(.+)peach", "apple(.+)banana", "orange(.+)banana")
replacement <- c("apple\\1", "banana\\1", "peach\\1", "banana\\1", "apple\\1", "banana\\1")

Since Reduce only allows one argument at a time, and we want to reduce over each pattern with each replacement, I'll convert those two into a single list of pairs:

ptn_repl <- Map(c, pattern, replacement)
ptn_repl
# $`banana(.+)apple`
# [1] "banana(.+)apple" "apple\\1"       
# $`banana(.+)orange`
# [1] "banana(.+)orange" "banana\\1"       
# $`peach(.+)apple`
# [1] "peach(.+)apple" "peach\\1"      
# $`banana(.+)peach`
# [1] "banana(.+)peach" "banana\\1"      
# $`apple(.+)banana`
# [1] "apple(.+)banana" "apple\\1"       
# $`orange(.+)banana`
# [1] "orange(.+)banana" "banana\\1"       

From here, the reduction is straight-forward:

trimws(Reduce(function(prev, this) gsub(this[[1]], this[[2]], prev), ptn_repl, init = string))
# [1] "tasty apple"         "tasty orange"        "yellow banana"       "red tasty peach"     "tasty apple"        
# [6] "tasty apple yellow"  "yellow banana"       "peach tasty"         "yellow banana tasty"

I use trimws because the (.+) pattern captures and retains the space.

The regex components do two things:

  • (.+) matches one or more anything, including blank spaces; and
  • \\1 takes the (.+) from the pattern and replaces it here, with the effect that the other components in pattern are dropped, and the literal component in replacement (which is here always one of the literals in pattern) is prepended to the pattern.

Upvotes: 0

JKupzig
JKupzig

Reputation: 1303

Here is a simple solution using a nested for-loop. The idea is to (1) reverse the replacement string, so it shows which word to delete and (2) then detect the case where the pattern is part of the string and (3) delete the word, defined in (1):

    reverse_replacement <- unlist(lapply(1:length(pattern), 
                                  function(x) {
                                    stringr::str_trim(stringr::str_remove(pattern[x], replacement[x]), "both") }))
index = 0
for (word_combi in string) {
  index <- index  + 1
  index_pattern <- 0
  
  for (pat in pattern) {
    index_pattern <- index_pattern + 1
    words_pattern <- stringr::str_split(pat, " ", n = Inf, simplify = FALSE)[[1]]
    words <- stringr::str_detect(word_combi, words_pattern)
    
    if (sum(words) == length(words_pattern)) {
      string[index] <- stringr::str_trim(stringr::str_remove(word_combi, reverse_replacement[index_pattern]), "both")
    }
  }
}

string
[1] "tasty apple"         "tasty orange"        "yellow banana"       "red tasty peach"    
[5] "tasty  apple"        "tasty apple yellow"  "yellow  banana"      "peach tasty"        
[9] "yellow banana tasty"

Upvotes: 0

Related Questions