scikit-learn, add features to a vectorized set of documents

Question

I am starting with scikit-learn and I am trying to transform a set of documents into a format on which I could apply clustering and classification. I have seen the details about the vectorization methods, and the tfidf transformations to load the files and index their vocabularies.

However, I have extra metadata for each documents, such as the authors, the division that was responsible, list of topics, etc.

How can I add features to each document vector generated by the vectorizing function?

ogrisel · Accepted Answer

You could use the DictVectorizer for the extra categorical data and then use scipy.sparse.hstack to combine them.

scikit-learn, add features to a vectorized set of documents

Answers (1)

Related Questions