Reputation: 341
When I used bigrams, I appended the list of bigrams to the unigrams and used that as my corpus. With trigrams, I added trigrams to unigrams but left out bigrams.
Is this the correct approach, or would it be better to include bigrams as well if I want to incorporate trigrams? Should the process instead be: unigrams -> unigrams + bigrams -> unigrams + bigrams + trigrams?
Upvotes: 1
Views: 1519
Reputation: 341
After learning a bit more about features and tf-idf, I feel somewhat equipped to answer this question now.
The most basic version of TF-IDF uses unigrams to build the vocabulary. One way to capture multi-word expressions is adding higher order n-grams to the vocabulary, like bigrams and trigrams. Bigrams and trigrams capture expressions two words and three words long respectively and compare their prevalence across documents.
Where do you get the most bang for your buck when it comes to n-grams and multi-word expressions? It seems reasonable to start with bigrams as there are more two-word expressions than three-word expressions. Expressions like "brown fox" and "tall woman" will become distinct from "brown", "fox", "tall", and "woman". There is certainly a lot of value in trigrams and above (e.g. "quick brown fox"), but this value probably decreases as n gets higher, as the probability of capturing real expressions and not noise diminishes.
My question, however, wasn't about if trigrams were useful or not, but if we should also use bigrams when we decide to use unigrams and trigrams. While there is no right answer, I can't think of a case where skipping bigrams and going straight to trigrams would make sense, meaning you would ignore all the two-word expressions in your data. You wouldn't want to leave out the strong explanatory power of bigrams even if you want to add higher-order n-grams.
Upvotes: 3