Reputation: 320

How does R's iml package handle syntactically invalid factor levels?

I'm using the iml package to derive ALE values from a caret trained rf model. In classification tasks where the levels of the dependent variable have syntactically invalid string values this can cause issues as under the hood these levels end up as column names during prediction.

Here is a silly example which will throw an undefined columns selected error with the last line of code:

# ----- Packages -----
library(randomForest)
library(caret)
library(iml)

# ----- Dummy Data -----
One <- as.factor(sample(c("1", "0"), size = 250, replace = TRUE))
Two <- as.factor(sample(make.names(c("1", "0")), size = 250, replace = TRUE))
Three <- as.factor(sample(c("A-1_x", "B-0_y", "1 C-$_3.5"), size = 250, replace = TRUE))
Four <- as.factor(sample(make.names(c("A-1_x", "B-0_y", "1 C-$_3.5")), size = 250, replace = TRUE))
df <- cbind.data.frame(One, Two, Three, Four)

# ----- Modelling + IML for syntactically invalid levels from "Three" -----
ALE.ClassOfInterest <- "1 C-$_3.5"
TrainData <- cbind.data.frame(One, Two, Four)
rf <- caret::train(TrainData, Three, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df, class=ALE.ClassOfInterest)
FE3 <- FeatureEffects$new(Pred, features=names(df), method="ale")$results

I had some examples where a very simple modifcation did the trick, simply calling make.names in the 2nd last line of code like so:

Pred <- Predictor$new(rf, data=df, class=make.names(ALE.ClassOfInterest))

However, in the above example this does not help and the only solution I found is to use make.names at the very beginning to turn all levels into syntactically valid strings before even training the model (see column "Four"). However, I'd like to stick to the original strings for various reasons and I have noted that other equally invalid levels like "0", "1" (see column "One") don't require any workarounds and this works:

# ----- Modelling + IML for syntactically invalid levels from "One" -----
ALE.ClassOfInterest <- "1"
TrainData <- cbind.data.frame(Two, Three, Four)
rf <- caret::train(TrainData, One, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df, class=ALE.ClassOfInterest)
FE1 <- FeatureEffects$new(Pred, features=names(df), method="ale")$results

Does anyone know what is happening under the hood if it is not a plain make.names or can suggest a solution which let's me stick to the original factor levels in the model?

Thanks, Mark

Upvotes: 1

Answers (3)

Giuseppe

Reputation: 816

there was a bug that should have been fixed since version 0.11.3 now (without the need for make.names). Maybe a few words in case others encounter similar issues:

The main issue/bug here was that the class you specify in Predictor accesses the columns produced by the $prediction.function() method (which does not have to be the same as the class level), i.e., in your case (and the iml version 0.11.1) it looked like this:

> Pred <- Predictor$new(rf, data=df, class=ALE.ClassOfInterest)
> head(Pred$prediction.function(df))
  1.C.._3.5 A.1_x B.0_y
1         0     0     1
2         1     0     0
3         0     1     0
4         0     0     1
5         0     1     0
6         0     0     1

Compare the column names with make.names(ALE.ClassOfInterest) and ALE.ClassOfInterest which may illustrate why your code above failed:

> make.names(ALE.ClassOfInterest)
[1] "X1.C.._3.5"
> ALE.ClassOfInterest
[1] "1 C-$_3.5"

Using class = "1.C.._3.5" should have worked (obviously, this is unintuitive and should be fixed as of version 0.11.3). In case something similar happens again, just let me know in the iml issue tracker. I hope that, with my comment above, people can now better understand what's going on and that using one of the column names produced by Pred$prediction.function(df) instead of the class level of interest should do the job (although I hope that this is not necessary anymore since I think I fixed the issue).

Upvotes: 1

MarkH

Reputation: 320

For sake of completeness here is a complete example including the workaround I didn't really want to use which shows that:

for syntactically invalid levels like "0" make.names within iml's Predictor$newis not required and would actually cause an error and instead it just works as if it were syntactically correct

for syntactically invalid levels like "ABC01-01_02::XYZ02-01_2" make.names within iml's Predictor$new is a valid workaround

for syntactically invalid levels like "1 C-$_3.5" make.names within iml's Predictor$new is not a valid workaround but doing nothing as for "0" does not work either

creating syntactically valid levels by applying make.names before training a model works for all three examples above and does not require any special treatment within iml's Predictor$new

# Packages
library(randomForest)
library(caret)
library(iml)

# Syntactically Invalid Levels
I1 <- as.factor(sample(c("1", "0"), size = 250, replace = TRUE))
I2 <- as.factor(sample(c("A-1_x", "B-0_y", "1 C-$_3.5"), size = 250, replace = TRUE))
I3 <- as.factor(sample(c("ABC01-01_02", "XYZ02-01_2", "ABC01-01_02::XYZ02-01_2"), size = 250, replace = TRUE))
df.invalid <- cbind.data.frame(I1,I2,I3)

# Syntactically Valid Levels
V1 <- as.factor(sample(make.names(c("1", "0")), size = 250, replace = TRUE))
V2 <- as.factor(sample(make.names(c("A-1_x", "B-0_y", "1 C-$_3.5")), size = 250, replace = TRUE))
V3 <- as.factor(sample(make.names(c("ABC01-01_02", "XYZ02-01_2", "ABC01-01_02::XYZ02-01_2")), size = 250, replace = TRUE))
df.valid <- cbind.data.frame(V1,V2,V3)


# Using df.invalid trying to apply make.names within iml only

# Classification for "1"
ALE.ClassOfInterest <- "1"
TrainData <- cbind.data.frame(I2,I3)
rf <- caret::train(TrainData, I1, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
# For "1" no make.names is required
Pred <- Predictor$new(rf, data=df.invalid, class=ALE.ClassOfInterest)
FE1 <- FeatureEffects$new(Pred, features=names(df.invalid), method="ale")$results
# For "1" make.names causes an error
#Pred <- Predictor$new(rf, data=df.invalid, class=make.names(ALE.ClassOfInterest))
#FE1 <- FeatureEffects$new(Pred, features=names(df.invalid), method="ale")$results

# Classification for "1 C-$_3.5"
ALE.ClassOfInterest <- "1 C-$_3.5"
TrainData <- cbind.data.frame(I1,I3)
rf <- caret::train(TrainData, I2, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
# For "1 C-$_3.5" no make.names causes an error
#Pred <- Predictor$new(rf, data=df.invalid, class=ALE.ClassOfInterest)
#FE1 <- FeatureEffects$new(Pred, features=names(df.invalid), method="ale")$results
# For "1 C-$_3.5" make.names also causes an error
Pred <- Predictor$new(rf, data=df.invalid, class=make.names(ALE.ClassOfInterest))
FE1 <- FeatureEffects$new(Pred, features=names(df.invalid), method="ale")$results

# Classification for "ABC01-01_02::XYZ02-01_2"
ALE.ClassOfInterest <- "ABC01-01_02::XYZ02-01_2"
TrainData <- cbind.data.frame(I1,I2)
rf <- caret::train(TrainData, I3, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
# For "ABC01-01_02::XYZ02-01_2" no make.names causes an error
#Pred <- Predictor$new(rf, data=df.invalid, class=ALE.ClassOfInterest)
#FE1 <- FeatureEffects$new(Pred, features=names(df.invalid), method="ale")$results
# For "ABC01-01_02::XYZ02-01_2" make.names avoids the error
Pred <- Predictor$new(rf, data=df.invalid, class=make.names(ALE.ClassOfInterest))
FE1 <- FeatureEffects$new(Pred, features=names(df.invalid), method="ale")$results


# Using df.valid applying make.names before model training

# Classification for "1"
ALE.ClassOfInterest <- make.names("1")
TrainData <- cbind.data.frame(V2,V3)
rf <- caret::train(TrainData, V1, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df.valid, class=ALE.ClassOfInterest)
FE1 <- FeatureEffects$new(Pred, features=names(df.valid), method="ale")$results

# Classification for make.names("1 C-$_3.5")
ALE.ClassOfInterest <- make.names("1 C-$_3.5")
TrainData <- cbind.data.frame(V1,V3)
rf <- caret::train(TrainData, V2, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df.valid, class=ALE.ClassOfInterest)
FE1 <- FeatureEffects$new(Pred, features=names(df.valid), method="ale")$results

# Classification for make.names("ABC01-01_02::XYZ02-01_2")
ALE.ClassOfInterest <- make.names("ABC01-01_02::XYZ02-01_2")
TrainData <- cbind.data.frame(V1,V2)
rf <- caret::train(TrainData, V3, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df.valid, class=ALE.ClassOfInterest)
FE1 <- FeatureEffects$new(Pred, features=names(df.valid), method="ale")$results

Upvotes: 2

r2evans

Reputation: 160447

This appears to be a feature/bug already identified to the package author in issue iml/195. I'm not optimistic for a quick fix, since that issue was identified in July 2022 (20 months ago as of writing this answer) with no commentary from the author. (The last change to R functions was in April 2022, it does not appear to get many updates.)

Upvotes: 2

How does R&#39;s iml package handle syntactically invalid factor levels?

Answers (3)

Related Questions

How does R's iml package handle syntactically invalid factor levels?