Reputation: 10604
I'm trying to apply the solution I found here to generate machine learning models:
Here's a dummy data set:
data_pred <- data.frame(x1 = 1:10, x2 = 11:20, x3 = 21:30)
data_resp <- data.frame(y1 = c(1:5, NA, 7:10), y2 = c(NA, 2, NA, 4:10))
Here was my for()
loop method of modeling the predictors in data_pred
on each individual column of measured responses in data_resp
using the caret
package:
# data_pred contains predictors
# data_resp contains one column per measurement
# 1 matching row per observation in both data_pred and data_resp
for (i in 1:ncol(data_resp)) {
train(x = data_pred[!is.na(data_resp[, i]), ],
y = data_resp[!is.na(data_resp[, i], i],
... )
}
Now I'm trying to do the same with lapply
, which I think has numerous advantages. I'm having an issue with translating the !is.na()
criteria on the fly so that I'm only modeling with non-NA cases for each response. Here was my initial function to test the lapply
method:
rf_func <- function(y) {
train(x = data_pred,
y = y,
method = "rf",
tuneGrid = data.frame(.mtry = 3:6),
nodesize = 3,
ntrees = 500,
trControl = trControl) }
Then create an empty list to store results and apply the function to data_resp
:
models <- list(NULL)
models$rf <- lapply(as.list(data_resp), rf_func)
That works fine since randomForest
can handle NA
s, but other methods cannot, so I need to remove those rows from each data_resp
element as well as the corresponding rows from my predictors.
I tried this without success:
train(x = data_pred_scale[!is.na(y), ],
y = y[!is.na(y)],
... }
I also tried y[[!is.na(y)]]
How do I translate the data.frame method (df[!is.na(df2), ]
) to lapply
?
Upvotes: 2
Views: 2693
Reputation: 10604
In fiddling around quite a bit with a single element of my as.list(data_frame)
to simulate what lapply
would be passing, I came up with this, which I think is working:
rf_func <- function(y) {
train(x = data_pred_scale[!(unlist(lapply(y, is.na))), ],
y = y[!(unlist(lapply(y, is.na)))],
method = "rf",
tuneGrid = data.frame(.mtry = 3:6),
nodesize = 3,
ntrees = 500,
trControl = trControl) }
models$rf <- lapply(as.list(data_resp), rf_func)
It does seem to be working. I [hackishly] compared the non-NA
data set to the trainingData
results in each caret
model like so:
nas <- NULL
for(i in 1:ncol(data_resp)) {nas <- c(nas, length(data_resp[!is.na(data_resp[, i]), i]))}
model_nas <- NULL
for(i in 1:length(nas)) {model_nas <- c(model_nas, nrow(models$rf[[i]]$trainingData))}
identical(nas, model_nas)
[1] TRUE
So, is y[!unlist(lapply(y, is.na)))]
the best/most elegant way to do this sort of thing It's pretty ugly...
Edit: Based on @Ricardo Saporta 's answer, I was able to come up with this (probably obvious to the veterans, but bear with me):
rf_func <- function(x, y) {
train(x = x,
y = y,
method = "rf",
tuneGrid = data.frame(.mtry = 3:6),
nodesize = 3,
ntrees = 500,
trControl = trControl) }
models$rf <- lapply(data_resp, function (y) {
rf_func(data_pred_scale[!is.na(y), ], y[!is.na(y)] )
}
)
Is there still a better way, or is that fairly decent? (Certainly prettier than my first mess up above.)
Upvotes: 0
Reputation: 55350
several different ways to go about it. A simple approach is with an anonymous function:
lapply(data_resp, function(x) rf_func(x[!is.na(x)]))
Upvotes: 3