Reputation: 5278
I'm using Python's sklearn to classify texts.
I call the function predict_proba
and it looks like this:
[[ 6.74918834e-53 1.59981248e-51 2.74934762e-26 1.24948745e-43
2.93801753e-48 3.43788315e-18 1.00000000e+00 2.96818867e-20]]
And even if I try to put in ambiguity data, it looks always like this. It doesn't seem probable to me that the classfier is always hundred percent sure, so what's the problem there?
At the moment I'm using the MultinomialNB classifier and it's about text classification. I'm using news paper articles with classes like sports, economy etc. to train my model. The size of training examples is 175, distributed like this:
{'business': 27,
'economy': 20,
'lifestyle': 22,
'opinion': 11,
'politics': 30,
'science': 21,
'sport': 21,
'tech': 23}
My pipeline looks like this and my features are mainly bag-of-words and some linguistic key figures like text length.
cv = CountVectorizer(min_df=1, ngram_range=(1,1), max_features=1000)
tt = TfidfTransformer()
lv = LinguisticVectorizer() # custom class
clf = MultinomialNB()
pipeline = Pipeline([
('features', FeatureUnion([
('ngram_tf_idf', Pipeline([
('counts', cv),
('tf_idf', tt),
])),
('linguistic', lv),
])),
('scaler', StandardScaler(with_mean=False)),
('classifier', clf)
])
If you want to take a look at my training examples, I've uploaded it there: wetransfer.com
UPDATE: Maybe it is worth mentioning that the current setup scores 0.67 on the test samples. But before using the StandardScaler
, the probabilities was distributed more realistic (i.e. not always 100 percent) – but it scored only 0.2.
UPDATE: After adding a MaxAbsScaler
in the pipeline, it seems to work correctly. Can someone explain this weird behaviour?
Upvotes: 2
Views: 1399
Reputation: 66775
This means, especially given that it is Naive Bayes that at least one holds:
Upvotes: 2