Reputation: 320
I'm using the iml
package to derive ALE values from a caret
trained rf
model. In classification tasks where the levels of the dependent variable have syntactically invalid string values this can cause issues as under the hood these levels end up as column names during prediction.
Here is a silly example which will throw an undefined columns selected error with the last line of code:
# ----- Packages -----
library(randomForest)
library(caret)
library(iml)
# ----- Dummy Data -----
One <- as.factor(sample(c("1", "0"), size = 250, replace = TRUE))
Two <- as.factor(sample(make.names(c("1", "0")), size = 250, replace = TRUE))
Three <- as.factor(sample(c("A-1_x", "B-0_y", "1 C-$_3.5"), size = 250, replace = TRUE))
Four <- as.factor(sample(make.names(c("A-1_x", "B-0_y", "1 C-$_3.5")), size = 250, replace = TRUE))
df <- cbind.data.frame(One, Two, Three, Four)
# ----- Modelling + IML for syntactically invalid levels from "Three" -----
ALE.ClassOfInterest <- "1 C-$_3.5"
TrainData <- cbind.data.frame(One, Two, Four)
rf <- caret::train(TrainData, Three, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df, class=ALE.ClassOfInterest)
FE3 <- FeatureEffects$new(Pred, features=names(df), method="ale")$results
I had some examples where a very simple modifcation did the trick, simply calling make.names in the 2nd last line of code like so:
Pred <- Predictor$new(rf, data=df, class=make.names(ALE.ClassOfInterest))
However, in the above example this does not help and the only solution I found is to use make.names
at the very beginning to turn all levels into syntactically valid strings before even training the model (see column "Four"). However, I'd like to stick to the original strings for various reasons and I have noted that other equally invalid levels like "0", "1" (see column "One") don't require any workarounds and this works:
# ----- Modelling + IML for syntactically invalid levels from "One" -----
ALE.ClassOfInterest <- "1"
TrainData <- cbind.data.frame(Two, Three, Four)
rf <- caret::train(TrainData, One, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df, class=ALE.ClassOfInterest)
FE1 <- FeatureEffects$new(Pred, features=names(df), method="ale")$results
Does anyone know what is happening under the hood if it is not a plain make.names
or can suggest a solution which let's me stick to the original factor levels in the model?
Thanks, Mark
Upvotes: 1
Views: 78
Reputation: 816
there was a bug that should have been fixed since version 0.11.3
now (without the need for make.names
). Maybe a few words in case others encounter similar issues:
The main issue/bug here was that the class
you specify in Predictor
accesses the columns produced by the $prediction.function()
method (which does not have to be the same as the class level), i.e., in your case (and the iml version 0.11.1
) it looked like this:
> Pred <- Predictor$new(rf, data=df, class=ALE.ClassOfInterest)
> head(Pred$prediction.function(df))
1.C.._3.5 A.1_x B.0_y
1 0 0 1
2 1 0 0
3 0 1 0
4 0 0 1
5 0 1 0
6 0 0 1
Compare the column names with make.names(ALE.ClassOfInterest)
and ALE.ClassOfInterest
which may illustrate why your code above failed:
> make.names(ALE.ClassOfInterest)
[1] "X1.C.._3.5"
> ALE.ClassOfInterest
[1] "1 C-$_3.5"
Using class = "1.C.._3.5"
should have worked (obviously, this is unintuitive and should be fixed as of version 0.11.3
). In case something similar happens again, just let me know in the iml issue tracker. I hope that, with my comment above, people can now better understand what's going on and that using one of the column names produced by Pred$prediction.function(df)
instead of the class level of interest should do the job (although I hope that this is not necessary anymore since I think I fixed the issue).
Upvotes: 1
Reputation: 320
For sake of completeness here is a complete example including the workaround I didn't really want to use which shows that:
for syntactically invalid levels like "0" make.names
within iml
's Predictor$new
is not required and would actually cause an error and instead it just works as if it were syntactically correct
for syntactically invalid levels like "ABC01-01_02::XYZ02-01_2" make.names
within iml
's Predictor$new
is a valid workaround
for syntactically invalid levels like "1 C-$_3.5" make.names
within iml
's Predictor$new
is not a valid workaround but doing nothing as for "0" does not work either
creating syntactically valid levels by applying make.names
before training a model works for all three examples above and does not require any special treatment within iml
's Predictor$new
# Packages
library(randomForest)
library(caret)
library(iml)
# Syntactically Invalid Levels
I1 <- as.factor(sample(c("1", "0"), size = 250, replace = TRUE))
I2 <- as.factor(sample(c("A-1_x", "B-0_y", "1 C-$_3.5"), size = 250, replace = TRUE))
I3 <- as.factor(sample(c("ABC01-01_02", "XYZ02-01_2", "ABC01-01_02::XYZ02-01_2"), size = 250, replace = TRUE))
df.invalid <- cbind.data.frame(I1,I2,I3)
# Syntactically Valid Levels
V1 <- as.factor(sample(make.names(c("1", "0")), size = 250, replace = TRUE))
V2 <- as.factor(sample(make.names(c("A-1_x", "B-0_y", "1 C-$_3.5")), size = 250, replace = TRUE))
V3 <- as.factor(sample(make.names(c("ABC01-01_02", "XYZ02-01_2", "ABC01-01_02::XYZ02-01_2")), size = 250, replace = TRUE))
df.valid <- cbind.data.frame(V1,V2,V3)
# Using df.invalid trying to apply make.names within iml only
# Classification for "1"
ALE.ClassOfInterest <- "1"
TrainData <- cbind.data.frame(I2,I3)
rf <- caret::train(TrainData, I1, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
# For "1" no make.names is required
Pred <- Predictor$new(rf, data=df.invalid, class=ALE.ClassOfInterest)
FE1 <- FeatureEffects$new(Pred, features=names(df.invalid), method="ale")$results
# For "1" make.names causes an error
#Pred <- Predictor$new(rf, data=df.invalid, class=make.names(ALE.ClassOfInterest))
#FE1 <- FeatureEffects$new(Pred, features=names(df.invalid), method="ale")$results
# Classification for "1 C-$_3.5"
ALE.ClassOfInterest <- "1 C-$_3.5"
TrainData <- cbind.data.frame(I1,I3)
rf <- caret::train(TrainData, I2, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
# For "1 C-$_3.5" no make.names causes an error
#Pred <- Predictor$new(rf, data=df.invalid, class=ALE.ClassOfInterest)
#FE1 <- FeatureEffects$new(Pred, features=names(df.invalid), method="ale")$results
# For "1 C-$_3.5" make.names also causes an error
Pred <- Predictor$new(rf, data=df.invalid, class=make.names(ALE.ClassOfInterest))
FE1 <- FeatureEffects$new(Pred, features=names(df.invalid), method="ale")$results
# Classification for "ABC01-01_02::XYZ02-01_2"
ALE.ClassOfInterest <- "ABC01-01_02::XYZ02-01_2"
TrainData <- cbind.data.frame(I1,I2)
rf <- caret::train(TrainData, I3, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
# For "ABC01-01_02::XYZ02-01_2" no make.names causes an error
#Pred <- Predictor$new(rf, data=df.invalid, class=ALE.ClassOfInterest)
#FE1 <- FeatureEffects$new(Pred, features=names(df.invalid), method="ale")$results
# For "ABC01-01_02::XYZ02-01_2" make.names avoids the error
Pred <- Predictor$new(rf, data=df.invalid, class=make.names(ALE.ClassOfInterest))
FE1 <- FeatureEffects$new(Pred, features=names(df.invalid), method="ale")$results
# Using df.valid applying make.names before model training
# Classification for "1"
ALE.ClassOfInterest <- make.names("1")
TrainData <- cbind.data.frame(V2,V3)
rf <- caret::train(TrainData, V1, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df.valid, class=ALE.ClassOfInterest)
FE1 <- FeatureEffects$new(Pred, features=names(df.valid), method="ale")$results
# Classification for make.names("1 C-$_3.5")
ALE.ClassOfInterest <- make.names("1 C-$_3.5")
TrainData <- cbind.data.frame(V1,V3)
rf <- caret::train(TrainData, V2, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df.valid, class=ALE.ClassOfInterest)
FE1 <- FeatureEffects$new(Pred, features=names(df.valid), method="ale")$results
# Classification for make.names("ABC01-01_02::XYZ02-01_2")
ALE.ClassOfInterest <- make.names("ABC01-01_02::XYZ02-01_2")
TrainData <- cbind.data.frame(V1,V2)
rf <- caret::train(TrainData, V3, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df.valid, class=ALE.ClassOfInterest)
FE1 <- FeatureEffects$new(Pred, features=names(df.valid), method="ale")$results
Upvotes: 2
Reputation: 160447
This appears to be a feature/bug already identified to the package author in issue iml/195. I'm not optimistic for a quick fix, since that issue was identified in July 2022 (20 months ago as of writing this answer) with no commentary from the author. (The last change to R functions was in April 2022, it does not appear to get many updates.)
Upvotes: 2