Representing documents in vector space model

Question

I have a very fundamental question. I have two sets of documents, one for training and one for testing. I would like to train a Logistic regression classifier with the training documents. I want to know if I'm doing the right thing.

First find the list of all unique words in the training document and call it vocabulary.
For each word in the vocabulary, find its TFIDF in every training document. A document is then represented as vector of these TFIDF scores.

My question is: 1. How do I represent the test documents? Say, one of the test documents does not have any word that is in the vocabulary. In that case , the TFIDF scores will be zero for all words in the vocabulary for that document.

I'm trying to use LIBSVM which uses the sparse vector format. For the case of the above document, which has all entries set to 0 in its vector representation, how do I represent it?

Rob Neuhaus · Accepted Answer

You have to store enough information about the training corpus to do the TF IDF transform on unseen documents. This means you'll need the document frequencies of the terms in the training corpus. Ignoring unseen words in test docs is fine. Your svm won't learn a weight for them anyway. Note that unseen terms should be rare in the test corpus if your training and test distributions are similar. So even if a few terms are dropped, you'll still have plenty of terms to classify the doc.

Representing documents in vector space model

Answers (1)

Related Questions