Reputation: 325
I did a example about web classifiation with Naive Bayes library (Python) and Thats working perfectly (classfy web pages very well).
Actually I have 2 questions. Firstly,
I'm using only content of webpage (article side). Thats no problem but, I want that integrate title with double weighted effect to output. I can retrieve title of the page which variable list name is titles[]. Thats my codes for classfy :
x_train = vectorizer.fit_transform(temizdata)
classifer.fit(x_train, y_train)
I can add title to article text, but this time article text and title have same weight.
in codes, temizdata
is my list which keep article text of web pages. and y_train
is classes. How can I do integrate titles[] to classification with double weighted ?
I used Countvectorizer for vectorize , and Naive Bayes MultinominalNB classifier.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
vectorizer = CountVectorizer()
classifer = MultinomialNB(alpha=.01)
Upvotes: 2
Views: 166
Reputation: 23637
If I understood your problem correctly, you want to use word count features derived from text and titles. Features from titles should receive twice the weight than text features. I don't think it makes sense to assign prior weights to features in this case (if it is even possible). After all, you do machine learning because you want the computer to figure out which features are more important.
I suggest two alternative approaches you could try:
Feature merging generate a x_text_train
from the text bodies and a x_title_train
from the titles and merge them like this:
x_text_train= text_vectorizer.fit_transform(temizdata)
x_title_train= title_vectorizer.fit_transform(titledata)
x_train = np.hstack(x_text_train, x_title_train)
Make sure to use two separate vectorizers for text and titles, so the classifier gets to know about the difference between text features and title features. If any of the features is more important, the classifier should work that out.
Do hierarchical classification: Train one classifier on the texts, like you already do. Train another classifier on the titles. Finally, train a third classifier on the output of the previous two classifiers.
Edit:
Upvotes: 1