Reputation: 1692
My input dependent variable has NA
values but was used to build linear models for several groups. How can I use the linear model coefficients to derive predicted values for each row among each group?
Example data
library(dplyr)
df <- iris
df[c(1:2,51:52,101:102),3] <- NA # add NA values to dependent variable
# Creates a linear model for each Species group
fitted_models <- df %>%
group_by(Species) %>%
do(model = lm(Petal.Length~Sepal.Length+Sepal.Width, data = .))
The below code works so long as there are no NA values in the dependent variables. However, the below code throws the following error message when you try to use it with the data with NA values (i.e., df
): Error: Incompatible lengths: 50, 48.
library(tidyr)
library(purrr)
# works
fitted_models <- iris %>%
nest(data = -Species) %>%
mutate(fit = map(data, ~ lm(Petal.Length ~ Sepal.Length + Sepal.Width, data = .x)),
fitted.values = map(fit, "fitted.values")) %>%
unnest(cols = c(data, fitted.values)) %>%
select(-fit)
# does not work
fitted_models <- df %>%
nest(data = -Species) %>%
mutate(fit = map(data, ~ lm(Petal.Length ~ Sepal.Length + Sepal.Width, data = .x)),
fitted.values = map(fit, "fitted.values")) %>%
unnest(cols = c(data, fitted.values)) %>%
select(-fit)
The ideal output (see below) would use the coefficients from each of the Species
models to derive predicted values for each row, including those with NA
in the dependent variable.
Species Sepal.Length Sepal.Width Petal.Length Petal.Width predicted.values
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 NA 0.2 1.47
2 setosa 4.9 3 NA 0.2 1.46
3 setosa 4.7 3.2 1.3 0.2 1.42
4 setosa 4.6 3.1 1.5 0.2 1.41
5 setosa 5 3.6 1.4 0.2 1.46
6 setosa 5.4 3.9 1.7 0.4 1.51
7 setosa 4.6 3.4 1.4 0.3 1.40
8 setosa 5 3.4 1.5 0.2 1.46
9 setosa 4.4 2.9 1.4 0.2 1.38
10 setosa 4.9 3.1 1.5 0.1 1.45
# ... with 140 more rows
Upvotes: 1
Views: 242
Reputation: 5287
This version with predict()
and purrr::map2()
works.
f <- function (.fit, .new_data) {
predict(.fit, newdata = .new_data)
}
df %>%
nest(data = -Species) %>%
mutate(
fit = map(data, ~ lm(Petal.Length ~ Sepal.Length + Sepal.Width, data = .x)),
yhat = map2(.x = fit, .y = data, f)
) %>%
unnest(cols = c(data, yhat)) %>%
select(-fit)
Output:
# A tibble: 150 × 6
Species Sepal.Length Sepal.Width Petal.Length Petal.Width yhat
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 NA 0.2 1.48
2 setosa 4.9 3 NA 0.2 1.46
3 setosa 4.7 3.2 1.3 0.2 1.42
4 setosa 4.6 3.1 1.5 0.2 1.41
5 setosa 5 3.6 1.4 0.2 1.46
6 setosa 5.4 3.9 1.7 0.4 1.51
7 setosa 4.6 3.4 1.4 0.3 1.40
8 setosa 5 3.4 1.5 0.2 1.46
9 setosa 4.4 2.9 1.4 0.2 1.39
10 setosa 4.9 3.1 1.5 0.1 1.46
# … with 140 more rows
I wish I could figure it out with the fitted.values
variable that's already calculated. I tried adding a row id before nesting, so the two data.frames could be left-joined (because data
has all 50 rows). But then I thought the two unnesting and one join operations are more conceptual work than rerunning the linear equation used in predict()
.
Upvotes: 2