ds_col
ds_col

Reputation: 139

How can I use custom resampling respecting temporal order for non identical tasks with different sizes?

I have tasks where the rows have temporal order (e.g. monthly data). I want to perform a "loo" type resampling, but the training data must always be earlier than the test data. So what I do is to generate a custom resampling in the following manner:

# Instantiate Resampling
resampling_backtest = rsmp("custom")

train_sets = list(1:30)     # n.b. we just deliberately call the list of splits "train_sets" and "test_sets"
test_sets = list(31)        # for later use in the instantiated resampling class, they will automatically be named "train_set" and "test_set" and be lists

for (testmonth in (32:task$nrow)) {
  
  train_sets <- append(train_sets, list(c(1:(testmonth-1))))
  test_sets <- append(test_sets, list(c(testmonth)))
}


resampling_backtest$instantiate(task, train_sets, test_sets)

My tasks are different subsets of a large sample that has one "Date" column. All of the subsamples are "ordered", as I first use task_n <- TaskClassif$new(...) and then task_n$set_col_roles("Date", roles = "order") for each of my n tasks.

Now, I have 2 problems:

  1. I have defined the resampling schemes, but a row id value of e.g. "2" will refer to different months. This may be not a real problem, if it were not for the point below
  2. When I make a list of the n tasks (list_of_tasks=list(task_1,...task_n)) and define a benchmark as below, I will get an error message
design = benchmark_grid(
  tasks = list_of_tasks,
  learners = list_of_learners,
  resamplings = resampling_backtest      
)

The error message is Error: All tasks must be uninstantiated, or must have the same number of rows.

So, what can I do here? Is there a way to hand over the resampling "uninstantiated"? Or do I need to manually define a resampling scheme for each of the n tasks separately? If yes, how can I hand that over to benchmark_grid()?

Upvotes: 0

Views: 103

Answers (1)

be-marc
be-marc

Reputation: 1491

Or do I need to manually define a resampling scheme for each of the n tasks separately?

Yes. Just create the benchmark design manually with data.table(). An example with instantiated resamplings:

library(mlr3)
library(data.table)

task_pima = tsk("pima")
task_spam = tsk("spam")

resampling_pima = rsmp("cv", folds = 3)
resampling_pima$instantiate(task_pima)

resampling_spam = rsmp("cv", folds = 3)
resampling_spam$instantiate(task_spam)

design = data.table(
  task = list(task_pima, task_spam),
  learner = list(lrn("classif.rpart"), lrn("classif.rpart")),
  resampling = list(resampling_pima, resampling_spam)
)

bmr = benchmark(design)
``

Upvotes: 1

Related Questions