Reputation: 11
I am trying to classify based on the bag-of-words model from NLP.
Here are my questions:
Upvotes: 0
Views: 2051
Reputation: 131
You may use the scikit learn's count vectorizer to first create vectors for given words in the document, use it to train a classifier of your choice and then use the classifer to test your data.
For the training set, you can use the vectorizer to train the data as follows:
LabeledWords=pd.DataFrame(columns=['word','label'])
LabeledWords.append({'word':'Church','label':'Religion'} )
vectorizer = CountVectorizer()
Xtrain,yTrain=vectorizer.fit_transform(LabeledWords['word']).toarray(),vectorizer.fit_transform(LabeledWords['label']).toarray()
You can then train the classifier of your choice with the above vectorizer like:
forest = RandomForestClassifier(n_estimators = 100)
clf=forest.fit(Xtrain,yTrain)
In order to test your data:
for each_word,label in Preprocessed_list:
test_featuresX.append(vectorizer.transform(each_word),toarray())
test_featuresY.append(label.toarray())
clf.score(test_featuresX,test_featuresY)
Upvotes: 2