Reputation: 107
I have a list of strings like this:
string <- c("tasty apple", "tasty orange", "yellow banana", "red tasty peach", "tasty banana apple", "tasty apple yellow banana", "yellow orange banana", "peach tasty apple", "yellow banana tasty peach")
When there is just one type of fruit in the string it is fine. However, when there are more than 2 of them I have a list of patterns and replacements:
pattern <- c("banana apple", "banana orange", "peach apple", "banana peach")
replacement <- c("apple", "banana", "peach", "banana")
I can remove one of fruits when they are next to each other in the string, however in my data there can be words between them and I do not know how to remove word. The order of the words in the string might differ as well.
I want it to be like this:
Before | After |
---|---|
tasty apple | tasty apple |
tasty orange | tasty orange |
yellow banana | yellow banana |
red tasty peach | red tasty peach |
tasty banana apple | tasty apple |
tasty apple yellow banana | tasty apple yellow |
yellow orange banana | yellow banana |
peach tasty apple | peach tasty |
yellow banana tasty peach | yellow banana tasty |
Upvotes: 1
Views: 69
Reputation: 161085
We can Reduce
this to iteratively use gsub
to replace a pattern=
with a replacement=
. Since you want both orders of fruits, I'm adding a few to your patterns so that we can get both. (If you have many more combinations, this can be automated using permutations.)
pattern <- c("banana(.+)apple", "banana(.+)orange", "peach(.+)apple", "banana(.+)peach", "apple(.+)banana", "orange(.+)banana")
replacement <- c("apple\\1", "banana\\1", "peach\\1", "banana\\1", "apple\\1", "banana\\1")
Since Reduce
only allows one argument at a time, and we want to reduce over each pattern with each replacement, I'll convert those two into a single list of pairs:
ptn_repl <- Map(c, pattern, replacement)
ptn_repl
# $`banana(.+)apple`
# [1] "banana(.+)apple" "apple\\1"
# $`banana(.+)orange`
# [1] "banana(.+)orange" "banana\\1"
# $`peach(.+)apple`
# [1] "peach(.+)apple" "peach\\1"
# $`banana(.+)peach`
# [1] "banana(.+)peach" "banana\\1"
# $`apple(.+)banana`
# [1] "apple(.+)banana" "apple\\1"
# $`orange(.+)banana`
# [1] "orange(.+)banana" "banana\\1"
From here, the reduction is straight-forward:
trimws(Reduce(function(prev, this) gsub(this[[1]], this[[2]], prev), ptn_repl, init = string))
# [1] "tasty apple" "tasty orange" "yellow banana" "red tasty peach" "tasty apple"
# [6] "tasty apple yellow" "yellow banana" "peach tasty" "yellow banana tasty"
I use trimws
because the (.+)
pattern captures and retains the space.
The regex components do two things:
(.+)
matches one or more anything, including blank spaces; and\\1
takes the (.+)
from the pattern and replaces it here, with the effect that the other components in pattern
are dropped, and the literal component in replacement
(which is here always one of the literals in pattern
) is prepended to the pattern.Upvotes: 0
Reputation: 1303
Here is a simple solution using a nested for-loop. The idea is to (1) reverse the replacement string, so it shows which word to delete and (2) then detect the case where the pattern is part of the string and (3) delete the word, defined in (1):
reverse_replacement <- unlist(lapply(1:length(pattern),
function(x) {
stringr::str_trim(stringr::str_remove(pattern[x], replacement[x]), "both") }))
index = 0
for (word_combi in string) {
index <- index + 1
index_pattern <- 0
for (pat in pattern) {
index_pattern <- index_pattern + 1
words_pattern <- stringr::str_split(pat, " ", n = Inf, simplify = FALSE)[[1]]
words <- stringr::str_detect(word_combi, words_pattern)
if (sum(words) == length(words_pattern)) {
string[index] <- stringr::str_trim(stringr::str_remove(word_combi, reverse_replacement[index_pattern]), "both")
}
}
}
string
[1] "tasty apple" "tasty orange" "yellow banana" "red tasty peach"
[5] "tasty apple" "tasty apple yellow" "yellow banana" "peach tasty"
[9] "yellow banana tasty"
Upvotes: 0