TfidfVectorizer NotFittedError

Question

I am using sklearn Pipeline and FeatureUnion to create features from text files and I want to print out the feature names.

First, I collect all transformations into a list.

In [225]:components
Out[225]: 
[TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
         dtype=, encoding=u'utf-8', input=u'content',
         lowercase=True, max_df=0.85, max_features=None, min_df=6,
         ngram_range=(1, 1), norm='l1', preprocessor=None, smooth_idf=True,
         stop_words='english', strip_accents=None, sublinear_tf=True,
         token_pattern=u'(?u)[#a-zA-Z0-9/\-]{2,}',
         tokenizer=StemmingTokenizer(proc_type=stem, token_pattern=(?u)[a-zA-Z0-9/\-]{2,}),
         use_idf=True, vocabulary=None),
 TruncatedSVD(algorithm='randomized', n_components=150, n_iter=5,
        random_state=None, tol=0.0),
 TextStatsFeatures(),
 DictVectorizer(dtype=, separator='=', sort=True,
         sparse=True),
 DictVectorizer(dtype=, separator='=', sort=True,
         sparse=True),
 TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
         dtype=, encoding=u'utf-8', input=u'content',
         lowercase=True, max_df=0.85, max_features=None, min_df=6,
         ngram_range=(1, 2), norm='l1', preprocessor=None, smooth_idf=True,
         stop_words='english', strip_accents=None, sublinear_tf=True,
         token_pattern=u'(?u)[a-zA-Z0-9/\-]{2,}',
         tokenizer=StemmingTokenizer(proc_type=stem, token_pattern=(?u)[a-zA-Z0-9/\-]{2,}),
         use_idf=True, vocabulary=None)]

For example the first component is a TfidfVectorizer() object.

components[0]
Out[226]: 
TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=0.85, max_features=None, min_df=6,
        ngram_range=(1, 1), norm='l1', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=True,
        token_pattern=u'(?u)[#a-zA-Z0-9/\-]{2,}',
        tokenizer=StemmingTokenizer(proc_type=stem, token_pattern=(?u)[a-zA-Z0-9/\-]{2,}),
        use_idf=True, vocabulary=None)

type(components[0])
Out[227]: sklearn.feature_extraction.text.TfidfVectorizer

But when I try to use the TfidfVectorizer method get_feature_names, it throws a NotFittedError

components[0].get_feature_names()
Traceback (most recent call last):

  File "", line 1, in 
    components[0].get_feature_names()

  File "C:\Users\fheng\AppData\Local\Continuum\Anaconda\lib\site-packages\sklearn\feature_extraction	ext.py", line 903, in get_feature_names
    self._check_vocabulary()

  File "C:\Users\fheng\AppData\Local\Continuum\Anaconda\lib\site-packages\sklearn\feature_extraction	ext.py", line 275, in _check_vocabulary
    check_is_fitted(self, 'vocabulary_', msg=msg),

  File "C:\Users\fheng\AppData\Local\Continuum\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 678, in check_is_fitted
    raise NotFittedError(msg % {'name': type(estimator).__name__})

**NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.**

Vivek Kumar · Accepted Answer

Have u used this list in a pipeline or featureUnion ? And have you called fit() method on them?

This error is you have not called fit() (ie. trained the models) and direclty trying to access the values.

TfidfVectorizer NotFittedError

Answers (1)

Related Questions