Partha S Satpathy
Partha S Satpathy

Reputation: 11

How to standardize the bag of words for train and test?

I am trying to classify based on the bag-of-words model from NLP.

  1. Did pre-processing of the train data using NLTK (punctuation, stop words removal, lower case, stemming etc.)
  2. Created tf-idf matrix for train.
  3. Did pre-processing of test.
  4. Created tf-idf matrix for test data.
  5. Train and Test data have different bag of words so the no of features are different, so we cannot use a classification algo like knn.
  6. I merged the train and test data together and created the tf-idf matrix. This solved the above problem of different bag of words. But the resultant matrix was too huge to process.

Here are my questions:

  1. Is there a way to create the exact bag of words for train and test?
  2. If there is not and my approach of adding train and test is correct, should I go for a dimensionality reduction algo like LDA?

Upvotes: 0

Views: 2051

Answers (1)

Shakar Bhattarai
Shakar Bhattarai

Reputation: 131

You may use the scikit learn's count vectorizer to first create vectors for given words in the document, use it to train a classifier of your choice and then use the classifer to test your data.

For the training set, you can use the vectorizer to train the data as follows:

 LabeledWords=pd.DataFrame(columns=['word','label'])

 LabeledWords.append({'word':'Church','label':'Religion'} )

 vectorizer = CountVectorizer()

 Xtrain,yTrain=vectorizer.fit_transform(LabeledWords['word']).toarray(),vectorizer.fit_transform(LabeledWords['label']).toarray()

You can then train the classifier of your choice with the above vectorizer like:

forest = RandomForestClassifier(n_estimators = 100) 
clf=forest.fit(Xtrain,yTrain)

In order to test your data:

for each_word,label in Preprocessed_list:
    test_featuresX.append(vectorizer.transform(each_word),toarray())
    test_featuresY.append(label.toarray())
clf.score(test_featuresX,test_featuresY) 

Upvotes: 2

Related Questions