Reputation: 439
I looked at multiple tutorials on how to derive n-grams (here I will stick to bigrams) and included them in the analysis in NLP.
My quesiton is that whether we need to include all the possible combinations of bigrams as features because not all the bigrams would be meaningful.
For example, if we have a sentence such as "I like this movie because it was fun and scary" and consider bigrams as well, these include (after pre-processing):
bigrams=["like movie","movie fun", "fun scary"]
Upvotes: 0
Views: 203
Reputation: 136
We may consider each bigram as a feature of different importance. Then the question can be reformulated as "How to choose the most important features?". As you have already mentioned, one way is to consider the top max features ordered by term frequency across the corpus. Other possible ways to choose the most important features are:
These are the most widespread methods of feature selection in the NLP field. It is still possible use other methods like recursive feature elimination or sequential feature selection, but these methods are not feasible if the number of informative features is low (like 1000) and the total number of features is high (like 10000).
Upvotes: 1