Remove substring rows from tibble

Question

I have a tibble:

df <- tibble(x = c('a', 'ab', 'abc', 'abcd', 'abd', 'efg'))

I want to remove rows that are substrings of other rows, resulting in:

result <- tibble(x = c('abcd', 'abd', 'efg'))

The solution must be quite efficient as there are ~1M rows of text.

danlooo · Accepted Answer

str_extract(df$x, "foo") == "foo" is to test if "foo" is a substring of any element in df$x. It will be always at least 1, because x is always a substring of itself. If this number is higher, it is also a substring of another element, so we need to remove them using filter(!).

library(tidyverse)

df <- tibble(x = c('a', 'ab', 'abc', 'abcd', 'abd', 'efg'))

df %>% filter(! (x %>% map_lgl(~ sum(str_extract(df$x, .x) == .x, na.rm = TRUE) > 1)))
#> # A tibble: 3 x 1
#>   x    
#>   
#> 1 abcd 
#> 2 abd  
#> 3 efg

^{Created on 2022-02-18 by the reprex package (v2.0.0)}

Remove substring rows from tibble

Answers (2)

Related Questions