Reputation: 2975
I am starting with scikit-learn and I am trying to transform a set of documents into a format on which I could apply clustering and classification. I have seen the details about the vectorization methods, and the tfidf transformations to load the files and index their vocabularies.
However, I have extra metadata for each documents, such as the authors, the division that was responsible, list of topics, etc.
How can I add features to each document vector generated by the vectorizing function?
Upvotes: 6
Views: 1771
Reputation: 40159
You could use the DictVectorizer
for the extra categorical data and then use scipy.sparse.hstack to combine them.
Upvotes: 10