rrefining
rrefining

Reputation: 13

Remove underscores between words so they don't appear in n-grams in R

before running a topic model, I put n-grams, so words in 2-3 chunks could appear in my topic model afterward.

toks_data_ngrams <- tokens_ngrams(toks_data, n=2:3)

After this, however, my topic model includes so many words like a_b, apple_banana, happy_hand.

How can I ignore those words with underscores? I don't want them to be included in my topic model. Is there any extra code for ngrams so ngrams don't catch words with underscore in between? (I've already removed punctuations and symbols during the pre-processing).

Thanks so much for all your inputs in advance!

Upvotes: 1

Views: 74

Answers (2)

phiver
phiver

Reputation: 23608

tokens_ngrams has a concatenator option. By default this is set to _. You can specify anything you want, a space for example:

tokens_ngrams(toks_data, n= 2:3, concatenator = " ")

Upvotes: 2

gaut
gaut

Reputation: 5958

You can exclude them using

toks_data_ngrams <- toks_data_ngrams[!grepl("_", toks_data_ngrams)]

In the future, always include reproducible examples in your questions

Upvotes: 0

Related Questions