deanpwr
deanpwr

Reputation: 191

Remove substring rows from tibble

I have a tibble:

df <- tibble(x = c('a', 'ab', 'abc', 'abcd', 'abd', 'efg'))

I want to remove rows that are substrings of other rows, resulting in:

result <- tibble(x = c('abcd', 'abd', 'efg'))

The solution must be quite efficient as there are ~1M rows of text.

Upvotes: 3

Views: 93

Answers (2)

det
det

Reputation: 5232

On small datasets this is slower (in those cases speed is not problem) but on bigger it is faster. Speed depends on how many unique groups there are compared to the data size.

df <- arrange(df, desc(nchar(x)))
my_strings <- df$x
i <- 1
while(i < length(my_strings)){
  
  indices <- which(str_detect(my_strings[[i]], my_strings[(i+1):length(my_strings)])) + i
  if(length(indices) > 0) my_strings <- my_strings[-indices]
  i <- i + 1
}

Possible improvement but didn't test:

setDT(df)
indices_df <- df[, .(indices = list(.I)), by = x][order(-nchar(x))]
my_strings <- indices_df$x
i <- 1
while(i < length(my_strings)){

  indices <- which(str_detect(my_strings[[i]], my_strings[(i+1):length(my_strings)])) + i
  if(length(indices) > 0) my_strings <- my_strings[-indices]
  i <- i + 1
}

df[indices_df[x %in% my_strings, unlist(indices)]]

Upvotes: 0

danlooo
danlooo

Reputation: 10637

str_extract(df$x, "foo") == "foo" is to test if "foo" is a substring of any element in df$x. It will be always at least 1, because x is always a substring of itself. If this number is higher, it is also a substring of another element, so we need to remove them using filter(!).

library(tidyverse)

df <- tibble(x = c('a', 'ab', 'abc', 'abcd', 'abd', 'efg'))

df %>% filter(! (x %>% map_lgl(~ sum(str_extract(df$x, .x) == .x, na.rm = TRUE) > 1)))
#> # A tibble: 3 x 1
#>   x    
#>   <chr>
#> 1 abcd 
#> 2 abd  
#> 3 efg

Created on 2022-02-18 by the reprex package (v2.0.0)

Upvotes: 3

Related Questions