Reputation: 711
I'm having difficulty understanding A) the output of naiveBayes and B) the predict() function for naiveBayes.
This is not my data, but here's a fun example of what I'm trying to do and the errors I am getting:
require(RTextTools)
require(useful)
script <- data.frame(lines=c("Rufus, Brint, and Meekus were like brothers to me. And when I say brother, I don't mean, like, an actual brother, but I mean it like the way black people use it. Which is more meaningful I think","If there is anything that this horrible tragedy can teach us, it's that a male model's life is a precious, precious commodity. Just because we have chiseled abs and stunning features, it doesn't mean that we too can't not die in a freak gasoline fight accident",
"Why do you hate models, Matilda","What is this? A center for ants? How can we be expected to teach children to learn how to read... if they can't even fit inside the building?","Look, I think I know what this is about and I'm complimented but not interested.",
"Hi Derek! My name's Little Cletus and I'm here to tell you a few things about child labor laws, ok? They're silly and outdated. Why back in the 30s, children as young as five could work as they pleased; from textile factories to iron smelts. Yippee! Hurray!","Todd, are you not aware that I get farty and bloated with a foamy latte?","Oh, I'm sorry, did my pin get in the way of your ass? Do me a favor and lose five pounds immediately or get out of my building like now!",
"It's that damn Hansel! He's so hot right now!","Obey my dog!",
"I hear words like beauty and handsomness and incredibly chiseled features and for me that's like a vanity of self absorption that I try to steer clear of.","Yeah, you're cool to hide here, but first me and him got to straighten some shit out.",
"I wasn't like every other kid, you know, who dreams about being an astronaut, I was always more interested in what bark was made out of on a tree. Richard Gere's a real hero of mine. Sting. Sting would be another person who's a hero. The music he's created over the years, I don't really listen to it, but the fact that he's making it, I respect that. I care desperately about what I do. Do I know what product I'm selling? No. Do I know what I'm doing today? No. But I'm here, and I'm gonna give it my best shot.","I totally agree with you. But how do you feel about male models?",
"So I'm rappelling down Mount Vesuvius when suddenly I slip, and I start to fall. Just falling, ahh ahh, I'll never forget the terror. When suddenly I realize Holy shit, Hansel, haven't you been smoking Peyote for six straight days, and couldn't some of this maybe be in your head?"))
people <- as.factor(c("Zoolander","Zoolander","Zoolander","Zoolander","Zoolander",
"Mugatu","Mugatu","Mugatu","Mugatu","Mugatu",
"Hansel","Hansel","Hansel","Hansel","Hansel"))
script.doc.matrix <- create_matrix(script$lines,language = "english",removeNumbers=TRUE, removeStopwords = TRUE, stemWords=FALSE)
script.matrix <- as.matrix(script.doc.matrix)
nb.script <- naiveBayes(script.matrix,people)
nb.predict <- predict(nb.script,script$lines)
nb.predict
My questions:
A) naiveBayes output:
When I run
nb.script$tables
I get tables such as this:
$young
young
people [,1] [,2]
Hansel 0.0 0.0000000
Mugatu 0.2 0.4472136
Zoolander 0.0 0.0000000
How am I supposed to interpret this??? I thought these were supposed to be probabilities, but I don't understand what each column, [,1] & [,2] mean. Also, aren't the probabilities presented in these tables supposed to add up to 1.0? Why don't they? It would make sense if there was a third column, should there be?
Should I be using type=raw
in naiveBayes()
perhaps??
B) predict() of the naiveBayes:
The output gives me Hansel as the prediction for every entry. I believe this is happening simply because it is Alphabetically the first class. In other instances in my predictions, if Hansel was listed 4x, Mugatu 6x, and Zoolander 5x, the predict() function would end up giving me Mugatu as the prediction for EVERY entry simply because it was listed the most times in the class vector.
edit: for my question... how can I get the prediction to give me an ACTUAL prediction???
Output of the prediction is as follows:
"> nb.predict
[1] Hansel Hansel Hansel Hansel Hansel Hansel Hansel Hansel Hansel Hansel Hansel [12] Hansel Hansel Hansel Hansel
Levels: Hansel Mugatu Zoolander
Here is a link to a similar question: R: Naives Bayes classifier bases decision only on a-priori probabilities However the answer isn't really helping me out too much.
Thanks in advance!
Upvotes: 1
Views: 1391
Reputation: 3286
For the first part of your question, the columns of your matrix script.matrix
are numeric. naiveBayes
interprets numeric inputs as continuous data from a Gaussian distribution. The tables you see in your answer give the sample mean (column 1) and standard deviation (column 2) for these numeric variables across the factor categories.
What you probably want is to have naiveBayes recognize that your input variables are indicators. A simple way to do that is to convert the entire script.matrix
to a character matrix:
# Convert columns to characters
script.matrix <- apply(as.matrix(script.doc.matrix),2,as.character)
With this change:
> nb.predict <- predict(nb.script,script$lines)
> nb.script$tables$young
young
people 0 1
Hansel 1.0 0.0
Mugatu 0.8 0.2
Zoolander 1.0 0.0
To see the predicted classes:
> nb.predict <- predict(nb.script, script.matrix)
> nb.predict
[1] Zoolander Zoolander Zoolander Zoolander Zoolander Mugatu Mugatu
[8] Mugatu Mugatu Mugatu Hansel Hansel Hansel Hansel
[15] Hansel
Levels: Hansel Mugatu Zoolander
To see the raw probabilities from the naiveBayes fit:
predict(nb.script, script.matrix, type='raw')
Upvotes: 3