ScientiaEtVeritas
ScientiaEtVeritas

Reputation: 5278

Always Probability of Hundred Percent: predict_proba, sklearn

I'm using Python's sklearn to classify texts.

I call the function predict_proba and it looks like this:

[[  6.74918834e-53   1.59981248e-51   2.74934762e-26   1.24948745e-43
    2.93801753e-48   3.43788315e-18   1.00000000e+00   2.96818867e-20]]

And even if I try to put in ambiguity data, it looks always like this. It doesn't seem probable to me that the classfier is always hundred percent sure, so what's the problem there?

At the moment I'm using the MultinomialNB classifier and it's about text classification. I'm using news paper articles with classes like sports, economy etc. to train my model. The size of training examples is 175, distributed like this:

    {'business': 27,
     'economy': 20,
     'lifestyle': 22,
     'opinion': 11,
     'politics': 30,
     'science': 21,
     'sport': 21,
     'tech': 23}

My pipeline looks like this and my features are mainly bag-of-words and some linguistic key figures like text length.

cv = CountVectorizer(min_df=1, ngram_range=(1,1), max_features=1000)
tt = TfidfTransformer()
lv = LinguisticVectorizer() # custom class
clf = MultinomialNB()

pipeline = Pipeline([
('features', FeatureUnion([
  ('ngram_tf_idf', Pipeline([
    ('counts', cv),
    ('tf_idf', tt),
  ])),
('linguistic', lv),
])),
 ('scaler', StandardScaler(with_mean=False)),
 ('classifier', clf)
])

If you want to take a look at my training examples, I've uploaded it there: wetransfer.com

UPDATE: Maybe it is worth mentioning that the current setup scores 0.67 on the test samples. But before using the StandardScaler, the probabilities was distributed more realistic (i.e. not always 100 percent) – but it scored only 0.2.

UPDATE: After adding a MaxAbsScaler in the pipeline, it seems to work correctly. Can someone explain this weird behaviour?

Upvotes: 2

Views: 1399

Answers (1)

lejlot
lejlot

Reputation: 66775

This means, especially given that it is Naive Bayes that at least one holds:

  • you have a bug in your data processing routine, maybe you transform your whole document as a single word, instead of actually chunking it into parts? Check every single step to make sure that your documents are actually encoded on word level.
  • your data is "corrupted" (there are unique words that uniquely identify your class), for example the newsgroups dataset originally consisted of headers information, where class name was literally specified (thus each document about sport had "group:sport@..." etc.)
  • you have huge disproportion of classes, and simply your model is just predicting the majority class all the time.

Upvotes: 2

Related Questions