Reputation: 3
I am trying to do a Naive Bayes classification using R (Package e1071). Tried the usual Golf example and I am always getting opposite result.
Scenario: If the weather is good, do I play Golf 'Yes' or 'No'? Very straightforward instance.
Created a training dataset (df) and as per the training dataset, i am expecting the result as 'Yes' for 'Good' weather but its giving me a 'No'.
[1] No
Levels: No Yes
Any reason why is it happening this way? Is my understanding wrong or am i doing something wrong?
All supports are much appreciated..
Cheers..!
weather <- c("Good", "Good", "Good", "Bad", "Bad","Good")
golf <- c("Yes","No","Yes","No","Yes","Yes")
df <- data.frame(weather, golf) #Training dataset
df[] <- lapply(df, factor) #changing df to factor variables
df_new <- data.frame(weather = "Good") #Test dataset
library(e1071)
model <- naiveBayes(golf ~.,data=df)
predict(model, df_new, type ="class")
Upvotes: 0
Views: 345
Reputation: 3647
Its a problem with factor levels: your test data doesn't have the correct levels. Some sample code should make it clear:
weather <- c("Good", "Good", "Good", "Bad", "Bad","Good")
golf <- c("Yes","No","Yes","No","Yes","Yes")
df <- data.frame(weather, golf) #Training dataset
df[] <- lapply(df, factor) #changing df to factor variables
Here are now three ways to create the sample data, 2 work by specifying a comparable factor:
df_new <- data.frame(weather = "Good")
df_new1 <- data.frame(weather = df$weather[nrow(df)])
df_new2 <- data.frame(weather = factor("Good", levels = levels(df$weather)))
library(e1071)
model <- naiveBayes(golf ~.,data=df)
predict(model, df_new, type ="class")
#> [1] No
#> Levels: No Yes
Predict works as expected on the factor variables
predict(model, df_new1)
#> [1] Yes
#> Levels: No Yes
predict(model, df_new2)
#> [1] Yes
#> Levels: No Yes
And we can see the levels are off on the original
lapply(c(df_new, df_new1, df_new2), levels)
#> $weather
#> [1] "Good"
#>
#> $weather
#> [1] "Bad" "Good"
#>
#> $weather
#> [1] "Bad" "Good"
Upvotes: 1
Reputation: 676
This is because factor encoding can be misleading. Indeed, if you do not make sure that factors in df
and df_new
are encoded the same way, you will get (seemingly) absurd results compared to what you see.
Take a look at the integer encoding of df
print(df$weather)
Good Good Good Bad Bad Good
print(as.integer(df$weather))
2 2 2 1 1 2
And compare it to the encoding of df_new
print(df_new$weather)
Good
print(as.integer(df_new$weather))
1
Good
has been mapped to 1
in df_new
, while 1
corresponds to Bad
in df
. So when you are applying your model, your are asking for a prediction based on a Bad
weather.
You need to set the factors of df_new
the same way they are encoded in df
df_new <- data.frame(weather = "Good") #Test dataset
df_new$weather <- factor(df_new$weather, levels(df$weather))
Upvotes: 1