MEric
MEric

Reputation: 966

Countvectorizer Initialization

I recently initialized a countvectorizer as follows:

vectorizer = CountVectorizer( input=u'content',
             encoding=u'utf-8',
             charset=None,
             decode_error=u'strict',
             charset_error=None,
             strip_accents=None,
             lowercase=True,
             preprocessor=None,
             tokenizer=None,
             stop_words=None,
             ngram_range=(1, 1),
             analyzer=u'word',
             max_df=1.0,
             min_df=0,
             token_pattern=u'(?u)\b\w\w+\b',
             max_features=None,
             vocabulary=None,
             binary=False,
             dtype=np.int64)

Afterwards, I made the call:

documents = ['1fbe01fe', '1fbe01ff']
x = vectorizer.fit_transform(documents)

which generated an error:

ValueError: empty vocabulary; perhaps the documents only contain stop words

However, when I remove the line "token_pattern=u'(?u)\b\w\w+\b'" from the initialization, the same lines do not generate an error. This confused me because as far as I know the default initializations for the parameters in countvectorizer do provide the same 'token_pattern'. Hence if I don't explicitly provide this pattern, won't it just fill in automatically, so the same error should be generated?

Any help would be appreciated!

Upvotes: 1

Views: 2370

Answers (1)

JAB
JAB

Reputation: 12799

The documented regex for token_pattern is not escaped. If you initialize count vectorizer with the defaults and then call get_params you can see the default for token pattern is actually u'(?u)\\b\\w\\w+\\b'

This is why it works with the default parameter. To check this run the below code:

vectorizer = CountVectorizer()
vectorizer.get_params

returns:

<bound method CountVectorizer.get_params of CountVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)>

Upvotes: 2

Related Questions