Reputation: 361
I'm using the TNT Search Text Classification module, https://github.com/teamtnt/tntsearch, and it works good, the problem is I do not know how to interpret the results - more specifically the likelihood of correct match. I have read that it uses Naive Bayes classifier but I'm unable to find what kind of probability distribution is the result. I've got my own small testing dataset of about 50 values (50 / 10 = 5 categories) and the guesses are fairly correct.
However, the likelihood number that this tool provides is a negative number somewhere in the range of about -15 to -25.
The question is, what value could be interpreted as not credible? Let's say that the tool is only <33% sure. What value would correspond to this assumption?
Upvotes: 3
Views: 264
Reputation: 361
I've got in touch with the TNTSearch developers. The classifier doesn't actually return a probability but a "highest score". And only for the best match.
As advised, I have made some changes to the code.
In class TeamTNT\TNTSearch\Classifier\TNTClassifier
I changed bits in the predict
method (softmax function inspired from here):
public function predict($statement)
{
$words = $this->tokenizer->tokenize($statement);
$best_likelihoods = [];
$best_likelihood = -INF;
$best_type = '';
foreach ($this->types as $type) {
$best_likelihoods[$type] = -INF;
$likelihood = log($this->pTotal($type)); // calculate P(Type)
$p = 0;
foreach ($words as $word) {
$word = $this->stemmer->stem($word);
$p += log($this->p($word, $type));
}
$likelihood += $p; // calculate P(word, Type)
if ($likelihood > $best_likelihood) {
$best_likelihood = $likelihood;
$best_likelihoods[$type] = $likelihood;
$best_type = $type;
}
}
return [
'likelihood' => $best_likelihood,
'likelihoods' => $best_likelihoods,
'probability' => $this->softmax($best_likelihoods),
'label' => $best_type
];
}
The percentual probability can be then found in $guess['probability']['$label']
.
Upvotes: 1