Remove underscores between words so they don't appear in n-grams in R

Question

before running a topic model, I put n-grams, so words in 2-3 chunks could appear in my topic model afterward.

toks_data_ngrams <- tokens_ngrams(toks_data, n=2:3)

After this, however, my topic model includes so many words like a_b, apple_banana, happy_hand.

How can I ignore those words with underscores? I don't want them to be included in my topic model. Is there any extra code for ngrams so ngrams don't catch words with underscore in between? (I've already removed punctuations and symbols during the pre-processing).

Thanks so much for all your inputs in advance!

phiver · Accepted Answer

tokens_ngrams has a concatenator option. By default this is set to _. You can specify anything you want, a space for example:

tokens_ngrams(toks_data, n= 2:3, concatenator = " ")

Remove underscores between words so they don't appear in n-grams in R

Answers (2)

Related Questions

Remove underscores between words so they don&#39;t appear in n-grams in R

Answers (2)

Related Questions

Remove underscores between words so they don't appear in n-grams in R