Reputation: 115
I'm currently using Naive Bayes to classify a bunch of texts. I have multiple categories. Right now I just output the posterior probability and the category, but what I would like to do is rank the categories based on the posterior probabilities and use the 2nd, 3rd place categories as "back up" categories.
Here's an example:
df = pandas.DataFrame({ 'text' : pandas.Categorical(["I have wings","Metal wings","Feathers","Airport"]), 'true_cat' : pandas.Categorical(["bird","plane","bird","plane"])})
text true_cat
-----------------------
I have wings bird
Metal wings plane
Feathers bird
Airport plane
What I'm doing:
new_cat = classifier.classify(features(text))
prob_cat = classifier.prob_classify(features(text))
Eventual Output:
new_cat prob_cat text true_cat
bird 0.67 I have wings bird
bird 0.6 Feathers bird
bird 0.51 Metal wings plane
plane 0.8 Airport plane
I have found a couple examples using classify_many and prob_classify_many but since I'm new to Python I'm having trouble translating it to my problem. I haven't seen it used with pandas anywhere.
I want it to look like this:
df_new = pandas.DataFrame({'text': pandas.Categorical(["I have wings","Metal wings","Feathers","Airport"]),'true_cat': pandas.Categorical(["bird","plane","bird","plane"]), 'new_cat1': pandas.Categorical(["bird","bird","bird","plane"]), 'new_cat2': pandas.Categorical(["plane","plane","plane","bird"]), 'prob_cat1': pandas.Categorical(["0.67","0.51","0.6","0.8"]), 'prob_cat2': pandas.Categorical(["0.33","0.49","0.4","0.2"])})
new_cat1 new_cat2 prob_cat1 prob_cat2 text true_cat
-----------------------------------------------------------------------
bird plane 0.67 0.33 I have wings bird
bird plane 0.51 0.49 Metal wings plane
bird plane 0.6 0.4 Feathers bird
plane bird 0.8 0.2 Airport plane
Any help would be appreciated.
Upvotes: 0
Views: 592
Reputation: 50220
I'm treating your self-answer as part of your question. Presumably you got the probability of the classification bird
like this:
prob_cat.prob("bird")
Here, prob_cat
is an nltk probability distribution (ProbDist
). You can get all categories in a discrete ProbDist
and their probability like this:
probs = list((x, prob_cat.prob(x)) for x in prob_cat.samples())
Since you already know the categories you trained with, you can use a predefined list instead of prob_cat.samples()
. Finally, you can order them from the most to the least probable in the same expression:
mycategories = ["bird", "plane"]
probs = sorted(((x, prob_cat.prob(x)) for x in mycategories), key=lambda tup: -tup[1])
Upvotes: 1
Reputation: 115
I'm starting to get there now.
#This gives me the probability it's a bird.
prob_cat.prob(bird)
#This gives me the probability it's a plane.
prob_cat.prob(plane)
Now since I have dozens of categories I'm working on a good way to have it give me all of them without putting in all of the category names, but that should be pretty simple.
Upvotes: 0