How can a blocking factor be included in makeClassifTask() from mlr package?

In some classification tasks, using mlr package, I need to deal with a data.frame similar to this one:

set.seed(pi)
# Dummy data frame
df <- data.frame(
   # Repeated values ID
   ID = sort(sample(c(0:20), 100, replace = TRUE)),
   # Some variables
   X1 = runif(10, 1, 10),
   # Some Label
   Label = sample(c(0,1), 100, replace = TRUE)
   )
df 

I need to cross-validate the model keeping together the values with the same ID, I know from the tutorial that:

https://mlr-org.github.io/mlr-tutorial/release/html/task/index.html#further-settings

We could include a blocking factor in the task. This would indicate that some observations "belong together" and should not be separated when splitting the data into training and test sets for resampling.

The question is how can I include this blocking factor in the makeClassifTask?

Unfortunately, I couldn't find any example.

Upvotes: 3

Views: 803

Answers (2)

Ty Seddon
Ty Seddon

Reputation: 111

The answer by @jakob-r no longer works. My guess is something changed with cv10.

Minor edit to use "blocking.cv = TRUE"

Complete working example:

set.seed(pi)
# Dummy data frame
df <- data.frame(
   # Repeated values ID
   ID = sort(sample(c(0:20), 100, replace = TRUE)),
   # Some variables
   X1 = runif(10, 1, 10),
   # Some Label
   Label = sample(c(0,1), 100, replace = TRUE)
   )
df 

df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
resDesc <- makeResampleDesc("CV",iters=10,blocking.cv = TRUE)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = resDesc)

# to prove-check that blocking worked
lapply(1:10, function(i) {
  blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
  blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
  intersect(blocks.testing, blocks.training)
})

Upvotes: 1

jakob-r
jakob-r

Reputation: 7282

What version of mlr do you have? Blocking should be part of it since a while. You can find it directly as an argument in makeClassifTask

Here is an example for your data:

df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = cv10)

# to prove-check that blocking worked
lapply(1:10, function(i) {
  blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
  blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
  intersect(blocks.testing, blocks.training)
})
#all entries are empty, blocking indeed works! 

Upvotes: 4

Related Questions