Reputation: 5380
I have a set of features picked - up by hand. Not all of them are single words; some of them are bigrams and some other are trigrams. I want to model my texts - that are provided in the form of raw texts explicitly based on these features. How can I do that in sklearn? This is how I have defined my Vectorizer so far.
def initialize():
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1, 3))
return vectorizer
Upvotes: 0
Views: 780
Reputation: 17629
CountVectorizer
and TfIdfVectorizer
allow you to specify the vocabulary to be used. Pass them as the keyword argument vocabulary
to the constructor. Quote from the docs:
vocabulary: Mapping or iterable, optional
Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.
Upvotes: 3