qaiser
qaiser

Reputation: 2868

countvectorizer not able to detect , words

final_vocab = {'Amazon',
'Big Bazaar',
'Brand Factory',
'Central',
'Cleartrip',
'Dominos',
'Flipkart',
'IRCTC',
'Lenskart',
'Lifestyle',
'MAX',
'MMT',
'More',
'Myntra'}
 
vect = CountVectorizer(vocabulary=final_vocab)
token_df = pd.DataFrame(vect.fit_transform(['Big Bazaar','Brand Factory']).todense(), columns=vect.get_feature_names())

enter image description here

why all output is zero ??? for Big Bazaar and brand factory values should come 1 ???

Upvotes: 0

Views: 233

Answers (1)

manaclan
manaclan

Reputation: 994

Your CountVectorizer is missing 2 things:

  1. ngram_range=(2,2) as stated in the docs: All values of n such such that min_n <= n <= max_n will be used. This help CountVectorizer get 2 gram vector from the input (Big Bazaar instead of ['Big','Bazaar'])
  2. lowercase=False which means: Convert all characters to lowercase before tokenizing. This will make Big Bazaar and Brand Factory became lower case and thus can't be found in vocabulary. Setting to False will prevent that from happening.

Also, because you've provided a vocabulary to CountVectorizer, use transform instead of fit_transform

from sklearn.feature_extraction.text import CountVectorizer

final_vocab = ['Amazon',
'Big Bazaar',
'Brand Factory',
'Central',
'Cleartrip',
'Dominos',
'Flipkart',
'IRCTC',
'Lenskart',
'Lifestyle',
'MAX',
'MMT',
'More',
'Myntra']
 
vect = CountVectorizer(vocabulary=final_vocab, ngram_range=(2, 2), lowercase=False)
token_df = pd.DataFrame(vect.transform(['Big Bazaar','Brand Factory']).todense(), columns=vect.get_feature_names())

Upvotes: 1

Related Questions