Remove words from a dataframe which are the same in different place

Question

Having words in a dfm like this library("quanteda")

Package version: 2.1.2

dfmat <- dfm(c("hello_text","text_hello","test1_test2", "test2_test1", "test2_test2_test2", "test2_other", "other"))

which for example the tokens "hello_text" and "text_hello" are the same in different place. How is it possile to keep only one of this options?

Example output

dfmat <- dfm(c("hello_text","test1_test2",  "test2_test2_test2", "test2_other", "other"))

I found this solution example but it removes the same words

dario · Accepted Answer

Splitting the strings at the underscore and sort them alphabetically, then use this list to identify duplicates and apply it to the original list:

words <- c("hello_text","text_hello","test1_test2", "test2_test1", "test2_test2_test2", "test2_other", "other")

words_sorted <- sapply(sapply(words, strsplit, "_"), sort)

words[!duplicated(words_sorted)]

Returns:

[1] "hello_text"        "test1_test2"       "test2_test2_test2" "test2_other"      
[5] "other"

Remove words from a dataframe which are the same in different place

Package version: 2.1.2

Answers (1)

Related Questions