diggi2395
diggi2395

Reputation: 335

R: Delete rows where one column is a substring of another

I have a data frame that looks like this:

c1      c2
fish    fishing
dog     tomato
cat     loop
horse   horse

I would now like to delete every row where c1 == c2 AND where c1 is a substring of c2 and vice versa. In my example, horse == horse and 'fish' is a substring of 'fishing'. I know about the grepl function, e.g.: df[!grepl(df$c1, df$c2),].

However, this solution does not account for substrings. Maybe there is a solution where I can use df[!grepl("STRING", df$c2),] for every row, so that "STRING" equals the value of df$c1?

Thanks in advance!

Upvotes: 2

Views: 126

Answers (2)

r2evans
r2evans

Reputation: 160417

base R

dat[!with(dat, mapply(grepl, c1, c2)) & !with(dat, mapply(grepl, c2, c1)),]
#    c1     c2
# 2 dog tomato
# 3 cat   loop

grepl only works on one pattern at a time: if you try multiple patterns (i.e., each of dat$c1), then you'll receive a warning (and not the intended output).

grepl(dat$c1, dat$c2)
# Warning in grepl(dat$c1, dat$c2) :
#   argument 'pattern' has length > 1 and only the first element will be used
# [1]  TRUE FALSE FALSE FALSE

We vectorize it (with mapply) and run it iteratively on each of the c1/c2 pairs.

Upvotes: 2

bird
bird

Reputation: 3294

Using tidyverse:

library(tidyverse)
df %>% 
        filter(!str_detect(c2, c1), !str_detect(c1, c2))

Output:

    c1     c2
1: dog tomato
2: cat   loop

This will work no matter which columns have similar words (not just like in your specific example).

Upvotes: 2

Related Questions