user706838
user706838

Reputation: 5380

How to tell scikit-learn vectorizer use specific features?

I have a set of features picked - up by hand. Not all of them are single words; some of them are bigrams and some other are trigrams. I want to model my texts - that are provided in the form of raw texts explicitly based on these features. How can I do that in sklearn? This is how I have defined my Vectorizer so far.

def initialize():
    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer(ngram_range=(1, 3))
    return vectorizer

Upvotes: 0

Views: 780

Answers (1)

Matt
Matt

Reputation: 17629

CountVectorizer and TfIdfVectorizer allow you to specify the vocabulary to be used. Pass them as the keyword argument vocabulary to the constructor. Quote from the docs:

vocabulary: Mapping or iterable, optional

Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.

Upvotes: 3

Related Questions