Always Probability of Hundred Percent: predict_proba, sklearn

Question

I'm using Python's sklearn to classify texts.

I call the function predict_proba and it looks like this:

[[  6.74918834e-53   1.59981248e-51   2.74934762e-26   1.24948745e-43
    2.93801753e-48   3.43788315e-18   1.00000000e+00   2.96818867e-20]]

And even if I try to put in ambiguity data, it looks always like this. It doesn't seem probable to me that the classfier is always hundred percent sure, so what's the problem there?

At the moment I'm using the MultinomialNB classifier and it's about text classification. I'm using news paper articles with classes like sports, economy etc. to train my model. The size of training examples is 175, distributed like this:

    {'business': 27,
     'economy': 20,
     'lifestyle': 22,
     'opinion': 11,
     'politics': 30,
     'science': 21,
     'sport': 21,
     'tech': 23}

My pipeline looks like this and my features are mainly bag-of-words and some linguistic key figures like text length.

cv = CountVectorizer(min_df=1, ngram_range=(1,1), max_features=1000)
tt = TfidfTransformer()
lv = LinguisticVectorizer() # custom class
clf = MultinomialNB()

pipeline = Pipeline([
('features', FeatureUnion([
  ('ngram_tf_idf', Pipeline([
    ('counts', cv),
    ('tf_idf', tt),
  ])),
('linguistic', lv),
])),
 ('scaler', StandardScaler(with_mean=False)),
 ('classifier', clf)
])

If you want to take a look at my training examples, I've uploaded it there: wetransfer.com

UPDATE: Maybe it is worth mentioning that the current setup scores 0.67 on the test samples. But before using the StandardScaler, the probabilities was distributed more realistic (i.e. not always 100 percent) – but it scored only 0.2.

UPDATE: After adding a MaxAbsScaler in the pipeline, it seems to work correctly. Can someone explain this weird behaviour?

Always Probability of Hundred Percent: predict_proba, sklearn

Answers (1)

Related Questions