How to standardize the bag of words for train and test?

Question

I am trying to classify based on the bag-of-words model from NLP.

Did pre-processing of the train data using NLTK (punctuation, stop words removal, lower case, stemming etc.)
Created tf-idf matrix for train.
Did pre-processing of test.
Created tf-idf matrix for test data.
Train and Test data have different bag of words so the no of features are different, so we cannot use a classification algo like knn.
I merged the train and test data together and created the tf-idf matrix. This solved the above problem of different bag of words. But the resultant matrix was too huge to process.

Here are my questions:

Is there a way to create the exact bag of words for train and test?
If there is not and my approach of adding train and test is correct, should I go for a dimensionality reduction algo like LDA?

Answers (1)