Hendy
Hendy

Reputation: 10604

Using lapply and !is.na to subset list vectors in R

I'm trying to apply the solution I found here to generate machine learning models:

Here's a dummy data set:

data_pred <- data.frame(x1 = 1:10, x2 = 11:20, x3 = 21:30)
data_resp <- data.frame(y1 = c(1:5, NA, 7:10), y2 = c(NA, 2, NA, 4:10))

Here was my for() loop method of modeling the predictors in data_pred on each individual column of measured responses in data_resp using the caret package:

# data_pred contains predictors
# data_resp contains one column per measurement
# 1 matching row per observation in both data_pred and data_resp

for (i in 1:ncol(data_resp)) {

   train(x = data_pred[!is.na(data_resp[, i]), ],
         y = data_resp[!is.na(data_resp[, i], i],
         ... )
}

Now I'm trying to do the same with lapply, which I think has numerous advantages. I'm having an issue with translating the !is.na() criteria on the fly so that I'm only modeling with non-NA cases for each response. Here was my initial function to test the lapply method:

rf_func <- function(y) {
  train(x = data_pred,
        y = y,
        method = "rf",
        tuneGrid = data.frame(.mtry = 3:6),
        nodesize = 3,
        ntrees = 500,
        trControl = trControl) }

Then create an empty list to store results and apply the function to data_resp:

models <- list(NULL)
models$rf <- lapply(as.list(data_resp), rf_func)

That works fine since randomForest can handle NAs, but other methods cannot, so I need to remove those rows from each data_resp element as well as the corresponding rows from my predictors.

I tried this without success:

train(x = data_pred_scale[!is.na(y), ],
      y = y[!is.na(y)],
      ... }

I also tried y[[!is.na(y)]]

How do I translate the data.frame method (df[!is.na(df2), ]) to lapply?

Upvotes: 2

Views: 2693

Answers (2)

Hendy
Hendy

Reputation: 10604

In fiddling around quite a bit with a single element of my as.list(data_frame) to simulate what lapply would be passing, I came up with this, which I think is working:

rf_func <- function(y) {
  train(x = data_pred_scale[!(unlist(lapply(y, is.na))), ], 
        y = y[!(unlist(lapply(y, is.na)))], 
        method = "rf",
        tuneGrid = data.frame(.mtry = 3:6),
        nodesize = 3,
        ntrees = 500,
        trControl = trControl) }

models$rf <- lapply(as.list(data_resp), rf_func)

It does seem to be working. I [hackishly] compared the non-NA data set to the trainingData results in each caret model like so:

nas <- NULL
for(i in 1:ncol(data_resp)) {nas <- c(nas, length(data_resp[!is.na(data_resp[, i]), i]))}

model_nas <- NULL
for(i in 1:length(nas)) {model_nas <- c(model_nas, nrow(models$rf[[i]]$trainingData))}

identical(nas, model_nas)
[1] TRUE

So, is y[!unlist(lapply(y, is.na)))] the best/most elegant way to do this sort of thing It's pretty ugly...


Edit: Based on @Ricardo Saporta 's answer, I was able to come up with this (probably obvious to the veterans, but bear with me):

rf_func <- function(x, y) {
  train(x = x,
        y = y,
        method = "rf",
        tuneGrid = data.frame(.mtry = 3:6),
        nodesize = 3,
        ntrees = 500,
        trControl = trControl) }

models$rf <- lapply(data_resp, function (y) {
  rf_func(data_pred_scale[!is.na(y), ], y[!is.na(y)] ) 
  }
)

Is there still a better way, or is that fairly decent? (Certainly prettier than my first mess up above.)

Upvotes: 0

Ricardo Saporta
Ricardo Saporta

Reputation: 55350

several different ways to go about it. A simple approach is with an anonymous function:

 lapply(data_resp, function(x) rf_func(x[!is.na(x)]))

Upvotes: 3

Related Questions