Using language models for term weighting

Question

I understand that scikit supports n-grams using a Vectorizer. But those are only strings. I would like to use a statistical language model (https://en.wikipedia.org/wiki/Language_model) like this one: http://www.nltk.org/_modules/nltk/model/ngram.html.

So, what I want is a Vectorizer using the probability as term weight instead of let's say tf-idf or simply a token count. Is there a reason why this is not supported by scikit? I'm relatively inexperienced with language modeling, so I'm not sure if this approach is a good idea for text classification.

lejlot · Accepted Answer

It depends what do you mean by term. If - as usual - term is just a word, then a probability model will work the same as... simple tf weighting (even without idf!). Why? Beacause empirical estimator of P(word) is just # word / # all_words, and as # all_words is constant, then the weight becomes just #word, which is simple term frequency. So in this sense, scikit does what you need.

Ok, so maybe you want to consider context? Then what kind of context? Do you want to analyze independently P(pre-word1, word) and use it as a weighted sum for word? Then why not P(word, post-word1)? Why not P(pre-word2, pre-word1, word, post-word1, post-word2) etc.? Why not to include some reweighting based on unigrams when bigrams are not available? The answer is quite simple, once you go into using language models as a weighting schemes, amount of possible introductions grows exponentialy, and there is no "typical" approach, which is worth implementing as a "standard" for a library which is not a NLP library.

Using language models for term weighting

Answers (1)

Related Questions