Unigram Gives Better Results than Ngram for Language Identification

Question

I have a school project which consists of identifying each language of a tweet from a dataset of tweets. The dataset contains tweets in Spanish, Portuguese, English, Basque, Galician and Catalan. The task is to implement a language identification model using unigrams, bigrams and trigrams and to analyze the efficiency of each model.

I understand the concepts of ngrams and I understand that the languages are somewhat similar (hence it's not that trivial of a task), but what I don't understand is that I'm getting better results for unigrams than bigrams and I'm getting better results for bigrams than trigrams.

I can't comprehend how is that possible since I expected a better efficiency for bigrams and trigrams.

Could you help me shed some light on why is this happening?

Thank you for your time.

Arya McCarthy · Accepted Answer

Short answer: higher order n-grams have a data sparsity problem. (We tend to address this with smoothing.) That can make them less informative, because so many are unseen, making the true data distribution harder to learn without more data.

You note that smaller smoothing amounts give better performance than higher ones. This is because the lower ones let you listen to your data more. The smoothing is like a 'prior belief', but the counts you get are representative of the actual data. If you smooth too much, now you're (almost) completely ignoring your data! The probability of any unigram becomes equally likely.

Unigram Gives Better Results than Ngram for Language Identification

Answers (1)

Related Questions