Sublinear TF transformation causes ValueError in sklearn

Question

I am doing some work with document classification and am using sklearn's hashing vectorizer followed by a tfidf transformation. If the Tfidf parameters are left at default, I have no problems. However, if I set sublinear_tf=True, the following error is raised:

ValueError                                Traceback (most recent call last)
 in ()
----> 5 tfidf.transform(test)

D:\Users\DB\Anaconda\lib\site-packages\sklearn\feature_extraction	ext.pyc in     transform(self, X, copy)
   1020 
   1021         if self.norm:
-> 1022             X = normalize(X, norm=self.norm, copy=False)
   1023 
   1024         return X

D:\Users\DB\Anaconda\lib\site-packages\sklearn\preprocessing\data.pyc in normalize(X, norm, axis, copy)
    533         raise ValueError("'%d' is not a supported axis" % axis)
    534 
--> 535     X = check_arrays(X, sparse_format=sparse_format, copy=copy)[0]
    536     warn_if_not_float(X, 'The normalize function')
    537     if axis == 0:

D:\Users\DB\Anaconda\lib\site-packages\sklearn\utils\validation.pyc in check_arrays(*arrays, **options)
    272                 if not allow_nans:
    273                     if hasattr(array, 'data'):
--> 274                         _assert_all_finite(array.data)
    275                     else:
    276                         _assert_all_finite(array.values())

D:\Users\DB\Anaconda\lib\site-packages\sklearn\utils\validation.pyc in _assert_all_finite(X)
     41             and not np.isfinite(X).all()):
     42         raise ValueError("Input contains NaN, infinity"
---> 43                          " or a value too large for %r." % X.dtype)
     44 
     45 

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I found a minimum sample of texts that cause the error and tried some diagnostics:

hv_stops = HashingVectorizer(ngram_range=(1,2), preprocessor=neg_preprocess, stop_words='english')
tfidf = TfidfTransformer(sublinear_tf=True).fit(hv_stops.transform(X))
test = hv_stops.transform(X[4:6])
print np.any(np.isnan(test.todense())) #False
print np.any(np.isinf(test.todense())) #False
print np.all(np.isfinite(test.todense())) #True
tfidf.transform(test) #Raises the ValueError

Any thoughts on what is causing the error? If any more information is needed, please let me know. Thanks in advance!

Edit:

This single text item is causing the error for me:

hv_stops = HashingVectorizer(ngram_range=(1,3), stop_words='english', non_negative=True)
item = u'b number  b number  b number  conclusion  no product_neg was_neg returned_neg for_neg evaluation_neg  review of the medd history records did not find_neg any_neg deviations_neg or_neg anomalies_neg  it is not suspected_neg that_neg the_neg product_neg failed_neg to_neg meet_neg specifications_neg  the investigation could not verify_neg or_neg identify_neg any_neg evidence_neg of_neg a_neg medd_neg deficiency_neg causing_neg or_neg contributing_neg to_neg the_neg reported_neg problem_neg  based on the investigation  the need for corrective action is not indicated_neg  should additional information be received that changes this conclusion  an amended medd report will be filed  zimmer considers the investigation closed  this mdr is being submitted late  as this issue was identified during a retrospective review of complaint files '
li = [item]
fail = hv_stops.transform(li)
TfidfTransformer(sublinear_tf=True).fit_transform(fail)

Fred Foo · Accepted Answer

I've found the cause. TfidfTransformer assumes that the sparse matrix it gets is canonical, i.e. it contains no actual zeros in its data member. However, HashingVectorizer produces a sparse matrix that does contain a stored zero. This causes the log-transform to produce -inf, and that in turn causes normalization to fail because the matrix has infinite norm.

This is a bug in scikit-learn; I made a report of it, but I'm not yet sure what the fix is.

Sublinear TF transformation causes ValueError in sklearn

Answers (1)

Related Questions