When are uni-grams more suitable than bi-grams (or higher N-grams)?

Question

I am reading about n-grams and I am wondering whether there is a case in practice when uni-grams would are preferred to be used over bi-grams (or higher N-grams). As I understand, the bigger N, the bigger complexity to calculate the probabilities and establish the vector space. But apart from that, are there other reasons (e.g. related to type of data)?

errantlinguist · Accepted Answer

This boils down to data sparsity: As your n-gram length increases, the amount of times you will see any given n-gram will decrease: In the most extreme example, if you have a corpus where the maximum document length is n tokens and you are looking for an m-gram where m=n+1, you will, of course, have no data points at all because it's simply not possible to have a sequence of that length in your data set. The more sparse your data set, the worse you can model it. For this reason, despite that a higher-order n-gram model, in theory, contains more information about a word's context, it cannot easily generalize to other data sets (known as overfitting) because the number of events (i.e. n-grams) it has seen during training becomes progressively less as n increases. On the other hand, a lower-order model lacks contextual information and so may underfit your data.

For this reason, if you have a very relatively large amount of token types (i.e. the vocabulary of your text is very rich) but each of these types has a very low frequency, you may get better results with a lower-order n-gram model. Similarly, if your training data set is very small, you may do better with a lower-order n-gram model. However, assuming that you have enough data to avoid over-fitting, you then get better separability of your data with a higher-order model.

When are uni-grams more suitable than bi-grams (or higher N-grams)?

Answers (2)

Related Questions