Reputation: 1076
the following code should be reproducible
someFrameA = data.frame(label="A", amount=rnorm(10000, 100, 20))
someFrameB = data.frame(label="B", amount=rnorm(1000, 50000, 20))
wholeFrame = rbind(someFrameA, someFrameB)
fit <- e1071::naiveBayes(label ~ amount, wholeFrame)
wholeFrame$predicted = predict(fit, wholeFrame)
nrow(subset(wholeFrame, predicted != label))
In my case, this gave 243 misclassifications.
Note these two rows: (row num, label, amount, prediction)
10252 B 50024.81895 A
2955 A 100.55977 A
10678 B 50010.26213 B
While the input is only different by 12.6, the classification changes. It's curious that the posterior probabilities for rows like this are so close:
> predict(fit, wholeFrame[10683, ], type="raw")
A B
[1,] 0.5332296 0.4667704
I am trying to classify some bank transactions using the transaction amount. I had many other text based features in my original model, but noticed something fishy when using just the numeric one.
> head(trainingSet)
category amount
1 check 688.00
2 non-businesstransaction 2.50
3 non-businesstransaction 36.00
4 non-businesstransaction 243.22
5 payroll 302.22
6 non-businesstransaction 16.18
fit <- e1071::naiveBayes(category ~ amount, data=trainingSet)
fit
Naive Bayes Classifier for Discrete Predictors
Call: naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
bankfee check creditcardpayment e-commercedeposit insurance
0.029798103 0.189613233 0.054001459 0.018973486 0.008270494
intrabanktransfer loanpayment mcapayment non-businesstransaction nsf
0.045001216 0.015689613 0.011432741 0.563853077 0.023351982
other payroll taxpayment utilitypayment
0.003405497 0.014838239 0.005716371 0.016054488
Conditional probabilities:
amount
Y [,1] [,2]
bankfee 103.58490 533.67098
check 803.44668 2172.12515
creditcardpayment 819.27502 2683.43571
e-commercedeposit 42.15026 59.24806
insurance 302.16500 727.52321
intrabanktransfer 1795.54065 11080.73658
loanpayment 308.43233 387.71165
mcapayment 356.62755 508.02412
non-businesstransaction 162.41626 951.65934
nsf 44.92198 78.70680
other 9374.81071 18074.36629
payroll 1192.79639 2155.32633
taxpayment 1170.74340 1164.08019
utilitypayment 362.13409 1064.16875
According to the e1071 docs, the first column for "conditional probabilities" is the mean of the numeric variable, and the other is the standard deviation. These means and stdevs are correct, as are the apriori probabilities.
So, it's troubling that this row:
> thatRow
category amount
40 other 11268.53
receives these posteriors:
> predict(fit, newdata=thatRow, type="raw")
bankfee check creditcardpayment e-commercedeposit insurance intrabanktransfer loanpayment mcapayment
[1,] 4.634535e-96 7.28883e-06 9.401975e-05 0.4358822 4.778703e-51 0.02582751 1.103762e-174 1.358662e-101
non-businesstransaction nsf other payroll taxpayment utilitypayment
[1,] 1.446923e-29 0.5364704 0.001717378 1.133719e-06 2.059156e-18 2.149142e-24
Note that "nsf" has about 300X the score than "other" does. Since this transaction has an amount of 11.2k dollars, if it were to follow that "nsf" distribution, it would be over 100 standard deviations from the mean. Meanwhile, since "other" transactions have a sample mean of about 9k dollars with a large standard deviation, I would think that this transaction is much more probable as an "other". While "nsf" is more likely wrt the prior probabilities, they aren't so different as to outweigh that tail observation, and there are plenty of other viable candidates besides "other" as well.
I was assuming that this package just looked at the normal(mew=samplemean, stdev=samplestdev) pdf and used that value to multiply, but is that not the case? I can't quite figure out how to see the source.
Datatypes seem to be fine too:
> class(trainingSet$amount)
[1] "numeric"
> class(trainingSet$category)
[1] "factor"
The "naive bayes classifier for discrete predictors" in the printout is maybe odd, since this is a continuous predictor, but I assume this package can handle continuous predictors.
I had similar results with the klaR package. Maybe I need to set the kernel option on that?
Upvotes: 1
Views: 1444
Reputation: 1076
The threshold argument is a large part of this. The code in the package has a bit like this:
L <- sapply(1:nrow(newdata), function(i) {
ndata <- newdata[i, ]
L <- log(object$apriori) + apply(log(sapply(seq_along(attribs),
function(v) {
nd <- ndata[attribs[v]]
if (is.na(nd)) rep(1, length(object$apriori)) else {
prob <- if (isnumeric[attribs[v]]) {
msd <- object$tables[[v]]
msd[, 2][msd[, 2] <= eps] <- threshold
dnorm(nd, msd[, 1], msd[, 2])
} else object$tables[[v]][, nd]
prob[prob <= eps] <- threshold
prob
}
The threshold (and this is documented) will replace any probabilities less than eps. So, if the normal pdf for the continuous variable is 0.000000000, it will become .001 by default.
> wholeFrame$predicted = predict(fit, wholeFrame, threshold=0.001)
> nrow(subset(wholeFrame, predicted != label))
[1] 249
> wholeFrame$predicted = predict(fit, wholeFrame, threshold=0.0001)
> nrow(subset(wholeFrame, predicted != label))
[1] 17
> wholeFrame$predicted = predict(fit, wholeFrame, threshold=0.00001)
> nrow(subset(wholeFrame, predicted != label))
[1] 3
Now, I believe that the quantities returned by the sapply are incorrect, since when "debugging" it, I got something like .012 for what should have been dnorm(49990, 100, 20), and I think something gets left out / mixed up with the mean and standard deviation matrix, but in any case, setting the threshold will help with this.
.001*(10/11) > pdfB*(1/11) or A having higher posterior than B due to this situation means that pdfB has to be less than .01 by chance.
> dnorm(49977, 50000, 20)
[1] 0.01029681
> 2*pnorm(49977, 50000, 20)
[1] 0.2501439
And since there were 1000 observations in class B, we should expect about 250 misclassifications, which is pretty close to the original 243.
Upvotes: 1