dcc310
dcc310

Reputation: 1076

Strange Results with Numeric Predictor in Naive Bayes in R

Update:

the following code should be reproducible

someFrameA = data.frame(label="A", amount=rnorm(10000, 100, 20))
someFrameB = data.frame(label="B", amount=rnorm(1000, 50000, 20))
wholeFrame = rbind(someFrameA, someFrameB)
fit <- e1071::naiveBayes(label ~ amount, wholeFrame)
wholeFrame$predicted = predict(fit, wholeFrame)
nrow(subset(wholeFrame, predicted != label))

In my case, this gave 243 misclassifications.

Note these two rows: (row num, label, amount, prediction)

10252     B 50024.81895         A
2955      A   100.55977         A
10678     B 50010.26213         B

While the input is only different by 12.6, the classification changes. It's curious that the posterior probabilities for rows like this are so close:

> predict(fit, wholeFrame[10683, ], type="raw")
             A         B
[1,] 0.5332296 0.4667704

Original Question:

I am trying to classify some bank transactions using the transaction amount. I had many other text based features in my original model, but noticed something fishy when using just the numeric one.

> head(trainingSet)
                 category amount
1                   check 688.00
2 non-businesstransaction   2.50
3 non-businesstransaction  36.00
4 non-businesstransaction 243.22
5                 payroll 302.22
6 non-businesstransaction  16.18

fit <- e1071::naiveBayes(category ~ amount, data=trainingSet)
fit

Naive Bayes Classifier for Discrete Predictors

Call: naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
                bankfee                   check       creditcardpayment       e-commercedeposit               insurance 
            0.029798103             0.189613233             0.054001459             0.018973486             0.008270494 
      intrabanktransfer             loanpayment              mcapayment non-businesstransaction                     nsf 
            0.045001216             0.015689613             0.011432741             0.563853077             0.023351982 
                  other                 payroll              taxpayment          utilitypayment 
            0.003405497             0.014838239             0.005716371             0.016054488 

Conditional probabilities:
                         amount
Y                               [,1]        [,2]
  bankfee                  103.58490   533.67098
  check                    803.44668  2172.12515
  creditcardpayment        819.27502  2683.43571
  e-commercedeposit         42.15026    59.24806
  insurance                302.16500   727.52321
  intrabanktransfer       1795.54065 11080.73658
  loanpayment              308.43233   387.71165
  mcapayment               356.62755   508.02412
  non-businesstransaction  162.41626   951.65934
  nsf                       44.92198    78.70680
  other                   9374.81071 18074.36629
  payroll                 1192.79639  2155.32633
  taxpayment              1170.74340  1164.08019
  utilitypayment           362.13409  1064.16875

According to the e1071 docs, the first column for "conditional probabilities" is the mean of the numeric variable, and the other is the standard deviation. These means and stdevs are correct, as are the apriori probabilities.

So, it's troubling that this row:

> thatRow
   category   amount
40    other 11268.53

receives these posteriors:

> predict(fit, newdata=thatRow, type="raw")
          bankfee       check creditcardpayment e-commercedeposit    insurance intrabanktransfer   loanpayment    mcapayment
[1,] 4.634535e-96 7.28883e-06      9.401975e-05         0.4358822 4.778703e-51        0.02582751 1.103762e-174 1.358662e-101
     non-businesstransaction       nsf       other      payroll   taxpayment utilitypayment
[1,]            1.446923e-29 0.5364704 0.001717378 1.133719e-06 2.059156e-18   2.149142e-24

Note that "nsf" has about 300X the score than "other" does. Since this transaction has an amount of 11.2k dollars, if it were to follow that "nsf" distribution, it would be over 100 standard deviations from the mean. Meanwhile, since "other" transactions have a sample mean of about 9k dollars with a large standard deviation, I would think that this transaction is much more probable as an "other". While "nsf" is more likely wrt the prior probabilities, they aren't so different as to outweigh that tail observation, and there are plenty of other viable candidates besides "other" as well.

I was assuming that this package just looked at the normal(mew=samplemean, stdev=samplestdev) pdf and used that value to multiply, but is that not the case? I can't quite figure out how to see the source.

Datatypes seem to be fine too:

> class(trainingSet$amount)
[1] "numeric"
> class(trainingSet$category)
[1] "factor"

The "naive bayes classifier for discrete predictors" in the printout is maybe odd, since this is a continuous predictor, but I assume this package can handle continuous predictors.

I had similar results with the klaR package. Maybe I need to set the kernel option on that?

Upvotes: 1

Views: 1444

Answers (1)

dcc310
dcc310

Reputation: 1076

The threshold argument is a large part of this. The code in the package has a bit like this:

 L <- sapply(1:nrow(newdata), function(i) {
        ndata <- newdata[i, ]
        L <- log(object$apriori) + apply(log(sapply(seq_along(attribs),
            function(v) {
                nd <- ndata[attribs[v]]
                if (is.na(nd)) rep(1, length(object$apriori)) else {
                  prob <- if (isnumeric[attribs[v]]) {
                    msd <- object$tables[[v]]
                    msd[, 2][msd[, 2] <= eps] <- threshold
                    dnorm(nd, msd[, 1], msd[, 2])
                  } else object$tables[[v]][, nd]
                  prob[prob <= eps] <- threshold
                  prob
                }

The threshold (and this is documented) will replace any probabilities less than eps. So, if the normal pdf for the continuous variable is 0.000000000, it will become .001 by default.

> wholeFrame$predicted = predict(fit, wholeFrame, threshold=0.001)
> nrow(subset(wholeFrame, predicted != label))
[1] 249
> wholeFrame$predicted = predict(fit, wholeFrame, threshold=0.0001)
> nrow(subset(wholeFrame, predicted != label))
[1] 17
> wholeFrame$predicted = predict(fit, wholeFrame, threshold=0.00001)
> nrow(subset(wholeFrame, predicted != label))
[1] 3

Now, I believe that the quantities returned by the sapply are incorrect, since when "debugging" it, I got something like .012 for what should have been dnorm(49990, 100, 20), and I think something gets left out / mixed up with the mean and standard deviation matrix, but in any case, setting the threshold will help with this.

.001*(10/11) > pdfB*(1/11) or A having higher posterior than B due to this situation means that pdfB has to be less than .01 by chance.

> dnorm(49977, 50000, 20)
[1] 0.01029681
> 2*pnorm(49977, 50000, 20)
[1] 0.2501439

And since there were 1000 observations in class B, we should expect about 250 misclassifications, which is pretty close to the original 243.

Upvotes: 1

Related Questions