Reputation: 335
I have a data frame that looks like this:
c1 c2
fish fishing
dog tomato
cat loop
horse horse
I would now like to delete every row where c1 == c2 AND where c1 is a substring of c2 and vice versa. In my example, horse == horse and 'fish' is a substring of 'fishing'. I know about the grepl function, e.g.: df[!grepl(df$c1, df$c2),]
.
However, this solution does not account for substrings. Maybe there is a solution where I can use df[!grepl("STRING", df$c2),]
for every row, so that "STRING" equals the value of df$c1?
Thanks in advance!
Upvotes: 2
Views: 126
Reputation: 160417
dat[!with(dat, mapply(grepl, c1, c2)) & !with(dat, mapply(grepl, c2, c1)),]
# c1 c2
# 2 dog tomato
# 3 cat loop
grepl
only works on one pattern at a time: if you try multiple patterns (i.e., each of dat$c1
), then you'll receive a warning (and not the intended output).
grepl(dat$c1, dat$c2)
# Warning in grepl(dat$c1, dat$c2) :
# argument 'pattern' has length > 1 and only the first element will be used
# [1] TRUE FALSE FALSE FALSE
We vectorize it (with mapply
) and run it iteratively on each of the c1
/c2
pairs.
Upvotes: 2
Reputation: 3294
Using tidyverse
:
library(tidyverse)
df %>%
filter(!str_detect(c2, c1), !str_detect(c1, c2))
Output:
c1 c2
1: dog tomato
2: cat loop
This will work no matter which columns have similar words (not just like in your specific example).
Upvotes: 2