Reputation: 282
Having words in a dfm like this library("quanteda")
dfmat <- dfm(c("hello_text","text_hello","test1_test2", "test2_test1", "test2_test2_test2", "test2_other", "other"))
which for example the tokens "hello_text" and "text_hello" are the same in different place. How is it possile to keep only one of this options?
Example output
dfmat <- dfm(c("hello_text","test1_test2", "test2_test2_test2", "test2_other", "other"))
I found this solution example but it removes the same words
Upvotes: 0
Views: 113
Reputation: 6483
Splitting the strings at the underscore and sort them alphabetically, then use this list to identify duplicates and apply it to the original list:
words <- c("hello_text","text_hello","test1_test2", "test2_test1", "test2_test2_test2", "test2_other", "other")
words_sorted <- sapply(sapply(words, strsplit, "_"), sort)
words[!duplicated(words_sorted)]
Returns:
[1] "hello_text" "test1_test2" "test2_test2_test2" "test2_other"
[5] "other"
Upvotes: 1