Akhil Nair
Akhil Nair

Reputation: 3284

Remove elements of a vector that are substrings of another

Is there a better way to achieve this? I'd like to remove all strings from this vector, which are substrings of other elements.

words = c("please can you", 
  "please can", 
  "can you", 
  "how did you", 
  "did you",
  "have you")
> words
[1] "please can you" "please can"     "can you"        "how did you"    "did you"        "have you"

library(data.table)
library(stringr)
dt = setDT(expand.grid(word1 = words, word2 = words, stringsAsFactors = FALSE))
dt[, found := str_detect(word1, word2)]
setdiff(words, dt[found == TRUE & word1 != word2, word2])
[1] "please can you" "how did you"    "have you" 

This works, but it seems like overkill and I'm interested to know a more elegant way of doing it.

Upvotes: 5

Views: 124

Answers (1)

G. Grothendieck
G. Grothendieck

Reputation: 270248

Search for each component of words in words keeping those that occur once:

words[colSums(sapply(words, grepl, words, fixed = TRUE)) == 1]

giving:

[1] "please can you" "how did you"    "have you"   

Upvotes: 6

Related Questions