BeginR
BeginR

Reputation: 11

Why predict is not delivering the expected result?

data <- data.frame(day_type = c("weekend", "weekend", "weekend","weekend",
                                "weekday", "weekday", "weekday", "weekday"),
                   vehicle = c("car", "car", "car", "car",
                               "bus", "bus", "bus", "bus"))

library(naivebayes)

model <- naive_bayes(vehicle ~ day_type, data = data)

predict(model, data.frame(day_type = "weekend"))
    [1] bus
Levels: bus car

Expected answer should be car here, but I am getting bus as answer. Please help to identify the error.

Upvotes: 1

Views: 33

Answers (1)

AntoniosK
AntoniosK

Reputation: 16121

This will help you understand the issue:

data <- data.frame(day_type = c("weekend", "weekend", "weekend","weekend",
                                "weekday", "weekday", "weekday", "weekday"),
                   vehicle = c("car", "car", "car", "car",
                               "bus", "bus", "bus", "bus"))

library(naivebayes)

model <- naive_bayes(vehicle ~ day_type, data = data)

dt_test1 = data.frame(day_type = "weekend")
dt_test2 = data.frame(day_type = "weekday")
dt_test3 = data.frame(day_type = c("weekend","weekday"))

predict(model, newdata = dt_test1)

# [1] bus
# Levels: bus car

predict(model, newdata = dt_test2)

# [1] bus
# Levels: bus car

predict(model, newdata = dt_test3)

# [1] car bus
# Levels: bus car

Test datasets 1 and 2 have 1 level and they assign the value 1 to "weekend" and "weekday" respectively. Then the model understands values 1 and 2 (based on what you have in your original dataset data) and doesn't care about the labels (weekday/weekend). However, in test dataset 3 you have two labels and they get the correct values (wwekend/weekday -> 1/2).

As an extreme case scenario check this:

dt_test4 = data.frame(day_type = c("January","February"))

predict(model, newdata = dt_test4)

# [1] car bus
# Levels: bus car

You will still get predictions! Because those values, that the model doesn't even understand, are coded to 1 and 2.

Therefore, as @Aaron suggested, make sure you make sure the factor levels match, or use character variables instead of factor variables.

Upvotes: 3

Related Questions