Do we include all the combinations of n-grams in the actual anlaysis?

Question

I looked at multiple tutorials on how to derive n-grams (here I will stick to bigrams) and included them in the analysis in NLP.
My quesiton is that whether we need to include all the possible combinations of bigrams as features because not all the bigrams would be meaningful.
For example, if we have a sentence such as "I like this movie because it was fun and scary" and consider bigrams as well, these include (after pre-processing):

bigrams=["like movie","movie fun", "fun scary"]

I am not sure this might be a good approach but what I can think of now is to include some frequent bigrams only as features.
or is there other practical norms to efficiently include meaningful bigrams only (although meaningful might be subjective and context-dependent)?

David Alami · Accepted Answer

We may consider each bigram as a feature of different importance. Then the question can be reformulated as "How to choose the most important features?". As you have already mentioned, one way is to consider the top max features ordered by term frequency across the corpus. Other possible ways to choose the most important features are:

Apply the TF-IDF weighting scheme. You will also be able to control two additional hyperparameters: max document frequency and min document frequency;
Use Principle Component Analysis to select the most informative features from a big feature set.
Train any estimator in scikit-learn and then select the features from the trained model.

These are the most widespread methods of feature selection in the NLP field. It is still possible use other methods like recursive feature elimination or sequential feature selection, but these methods are not feasible if the number of informative features is low (like 1000) and the total number of features is high (like 10000).

Do we include all the combinations of n-grams in the actual anlaysis?

Answers (1)

Related Questions