Mortimer
Mortimer

Reputation: 2975

scikit-learn, add features to a vectorized set of documents

I am starting with scikit-learn and I am trying to transform a set of documents into a format on which I could apply clustering and classification. I have seen the details about the vectorization methods, and the tfidf transformations to load the files and index their vocabularies.

However, I have extra metadata for each documents, such as the authors, the division that was responsible, list of topics, etc.

How can I add features to each document vector generated by the vectorizing function?

Upvotes: 6

Views: 1771

Answers (1)

ogrisel
ogrisel

Reputation: 40159

You could use the DictVectorizer for the extra categorical data and then use scipy.sparse.hstack to combine them.

Upvotes: 10

Related Questions