prashanth
prashanth

Reputation: 4495

Sklearn Countvectorizer on custom vocabulary

I have a set of webpages and i was in the process of getting the webpage count matrix. I tried to use the standard Countvectorizer from sklearn but not getting the required results. The sample code is as below:

from sklearn.feature_extraction.text import CountVectorizer
corpus = ['www.google.com www.google.com', 'www.google.com www.facebook.com', 'www.google.com', 'www.facebook.com']
vocab = {'www.google.com':0, 'www.facebook.com':1}
vectorizer = CountVectorizer(vocabulary=vocab)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray()) 

It gives

['www.google.com', 'www.facebook.com']
[[0 0]
 [0 0]
 [0 0]
 [0 0]]

But the required result is

['www.google.com', 'www.facebook.com']
[[2 0]
 [1 1]
 [1 0]
 [0 1]]

How do we apply countvectorizer on such a custom vocabulary?

Upvotes: 3

Views: 2820

Answers (1)

prashanth
prashanth

Reputation: 4495

As per the input from a related question, the issue occured because of the tokenizer. A customer tokenizer was written and now it works.

def mytokenizer(text):
    return text.split()

from sklearn.feature_extraction.text import CountVectorizer
corpus = ['www.google.com www.google.com', 'www.google.com www.facebook.com', 'www.google.com', 'www.facebook.com']
vocab = {'www.google.com':0, 'www.facebook.com':1}
vectorizer = CountVectorizer(vocabulary=vocab, tokenizer = mytokenizer)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray()) 

Upvotes: 2

Related Questions