Rajith
Rajith

Reputation: 3

Naive Bayes classification in R with opposite result..

I am trying to do a Naive Bayes classification using R (Package e1071). Tried the usual Golf example and I am always getting opposite result.

Scenario: If the weather is good, do I play Golf 'Yes' or 'No'? Very straightforward instance.

Created a training dataset (df) and as per the training dataset, i am expecting the result as 'Yes' for 'Good' weather but its giving me a 'No'.

[1] No
Levels: No Yes

Any reason why is it happening this way? Is my understanding wrong or am i doing something wrong?

All supports are much appreciated..

Cheers..!

weather <- c("Good", "Good", "Good", "Bad", "Bad","Good")
golf <- c("Yes","No","Yes","No","Yes","Yes")
df <- data.frame(weather, golf) #Training dataset

df[] <- lapply(df, factor) #changing df to factor variables

df_new <- data.frame(weather = "Good") #Test dataset

library(e1071)
model <- naiveBayes(golf ~.,data=df)
predict(model, df_new, type ="class")

Upvotes: 0

Views: 345

Answers (2)

gfgm
gfgm

Reputation: 3647

Its a problem with factor levels: your test data doesn't have the correct levels. Some sample code should make it clear:

weather <- c("Good", "Good", "Good", "Bad", "Bad","Good")
golf <- c("Yes","No","Yes","No","Yes","Yes")
df <- data.frame(weather, golf) #Training dataset

df[] <- lapply(df, factor) #changing df to factor variables

Here are now three ways to create the sample data, 2 work by specifying a comparable factor:

df_new <- data.frame(weather = "Good")
df_new1 <- data.frame(weather = df$weather[nrow(df)]) 
df_new2 <- data.frame(weather = factor("Good", levels = levels(df$weather)))

library(e1071)
model <- naiveBayes(golf ~.,data=df)
predict(model, df_new, type ="class")
#> [1] No
#> Levels: No Yes

Predict works as expected on the factor variables

predict(model, df_new1)
#> [1] Yes
#> Levels: No Yes

predict(model, df_new2)
#> [1] Yes
#> Levels: No Yes

And we can see the levels are off on the original

lapply(c(df_new, df_new1, df_new2), levels)
#> $weather
#> [1] "Good"
#> 
#> $weather
#> [1] "Bad"  "Good"
#> 
#> $weather
#> [1] "Bad"  "Good"

Upvotes: 1

AshOfFire
AshOfFire

Reputation: 676

This is because factor encoding can be misleading. Indeed, if you do not make sure that factors in df and df_new are encoded the same way, you will get (seemingly) absurd results compared to what you see.

Take a look at the integer encoding of df

print(df$weather)
Good Good Good Bad  Bad  Good
print(as.integer(df$weather))
2 2 2 1 1 2

And compare it to the encoding of df_new

print(df_new$weather)
Good
print(as.integer(df_new$weather))
1

Good has been mapped to 1 in df_new, while 1 corresponds to Bad in df. So when you are applying your model, your are asking for a prediction based on a Bad weather.

You need to set the factors of df_new the same way they are encoded in df

df_new <- data.frame(weather = "Good") #Test dataset
df_new$weather <- factor(df_new$weather, levels(df$weather))

Upvotes: 1

Related Questions