VMS
VMS

Reputation: 403

sklearn CountVectorizer returning all zeros - string conversion issue?

I am trying to use sklearn's CountVectorizer with a given vocabulary. My vocabulary is:

['humanitarian crisis', 'vacations for the anti-cruise crowd', 'school textbook', "b'cruise vacations for the anti-cruise", 'budget deal', "b'public school", 'u.n. announces', 'wrong petrol', 'vacations for the anti-cruise', "b'cruise vacations for the anti-cruise crowd"]

The input to vectorize on is taken from a pandas dataframe. I read this in from a csv with pd.read_csv and encoding='utf8':

29371            b'9 quirky and brilliant paris boutiques'
20525    b'public school textbook filled with muslim bi...
2871     b'congress focuses on averting shutdown, but t...
29902    b'yarmouk siege: u.n. announces trip to syria ...
45596    b'fracking protesters arrested for gluing them...
6266         b'cruise vacations for the anti-cruise crowd'

After a call to CountVectorizer(vocabulary=vocabulary).fit_transform(), I get a matrix of all zeros:

(<6x10 sparse matrix of type '<type 'numpy.int64'>'
    with 0 stored elements in Compressed Sparse Row format>, <class 'scipy.sparse.csr.csr_matrix'>)

Is this a problem because of the string types, or a problem with how I'm calling CountVectorizer? I'm not sure how else to convert the string types; I've tried multiple different calls to encode and decode in python2.7 and pandas. Any suggestions would be appreciated.

Upvotes: 2

Views: 662

Answers (1)

pearsonpark
pearsonpark

Reputation: 53

Use "ngram_range = (min_word_count, max_word_count)" when you call CountVectorizer.

Upvotes: 2

Related Questions