Combining feature sets using a MultinomialNB

Question

This is a very basic question about feature sets.

Let's say I have a group of people with various features that I want to make recommendations to. They have also written a paragraph of free form text that is highly significant in what I need to recommend to them.

I can understand how to vectorize their sample text but I don't know how then to add features such as nationality, age, sex etc etc.

So I have this:

#dbsession = sqlalchemy session

people = dbsession.query(People).filter(People.category!="inactive")
all_text = [(a.all_text, a.category) for a in people ]
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
X_train = vectorizer.fit_transform([x[0] for x in all_text])
y_train = ([x[1] for x in all_text])
classifier = MultinomialNB()
classifier.fit(X_train,y_train)

print("Training score: {0:.1f}%".format(classifier.score(X_train, y_train) * 100))
a  = People.populate_from_db(dbsession,2309601) # this gives me the person I want to categorise
print a

sample_text = a.all_text
t_form = vectorizer.transform([sample_text])
probs = classifier.predict_proba(t_form)
for i,p in enumerate(probs[0]):
    print "# ", classifier.classes_[i] , "%.2f %%" % (p*100)

(Yes I know that I shouldn't be using an item of the training set to test with, but I am just getting the code running first before putting real data into it.)

Now if a people object had a attribute such as "nationality", what is the best way to add that to the classifier ?

Raff.Edward · Accepted Answer

1) The issue of adding additional fields to your vector.

A) Just create a new X_train_extended that has the same number of dimensions as X_train + 1 for each thing you want to add. Copy the values over and insert what you want in the extra space

B) try to use the FeatureUnion from scikit to do that for you.

2) Will your addition make sense? In this case - no. Storing a numeric value of 'age' doesn't make sense for the MulinomialNB model. It might work anyway, but you should be aware that what you are doing is now violating the assumptions of the model you are trying to use.

No one can tell you what model you should use since we don't have your data, but you should understand what your model is and what assumptions it makes. Then you can decide what the best form of these additional features would be to put into your model.

Combining feature sets using a MultinomialNB

Answers (1)

Related Questions