Reputation: 607
I am running a logistic regression on many different subsets of a large data frame. To do so, I use the follwoing code (using dplyr
and purrr
):
# define model to be run
mod_fun <- function(df) {
glm(presence ~ transect, data = df, family = "binomial")
}
# nest data and run model
mod.glm <- dat %>%
nest(-c(region, fYear, species, road)) %>%
mutate(model = map(data, mod_fun))
# define functions to extract model coefficients
b_fun <- function(mod) {
coef(summary(mod))[2]
}
p_fun <- function(mod) {
coef(summary(mod))[8]
}
# extract coefficients
slope<-mod.glm %>% group_by(species, region, fYear, road) %>%
transmute(beta = map_dbl(model, b_fun),
p_val = map_dbl(model, p_fun))
As you can see, I only want to extract the estimate and p-value of the slope (called transect
). To do so, I use the indexing coef(summary(mod))[2]
etc. The problem is that there are also some subsets in my data frame resulting in over-determined systems where some coefficients will be set to NA. Using coef(summary(mod))[2]
extracts the 2nd value of the coef()
output and as NAs are ignored in coef()
, this will no longer be the estimate of transect
I want to extract. So far I tried unsing coef(summary(mod_2), complete = TRUE)
(--> nothing changes, NAs still not showed) and adressing the values directly coef(summary(mod_2), complete = TRUE)["transect","Estimate"]
(--> throws an error). Does anyone know how I could solve this issue?
What I tried so far:
# two example models; mod_2 will result in NAs
mod_1 <- glm(presence ~ transect, data = dat[dat$fYear == 1& dat$species=="Plantago lanceolata",], family = "binomial")
mod_2 <- glm(presence ~ transect, data = dat[dat$fYear == 2& dat$species=="Plantago lanceolata",], family = "binomial")
coef(summary(mod_1))[2] # works fine
coef(summary(mod_2))[2] # not the value I want
coef(summary(mod_1), complete = TRUE)["transect","Estimate"] # works fine
coef(summary(mod_2), complete = TRUE)["transect","Estimate"] # error
coef(summary(mod_2), complete = TRUE) # NAs for transect are still not displayed
summary(mod_2)$coefficients["transect","Estimate"] # is not working either
Data:
dput(dat)
structure(list(region = c("HWI", "HWI", "HWI", "HWI", "HWI",
"HWI", "HWI", "HWI", "HWI", "HWI", "HWI", "HWI", "HWI", "HWI",
"HWI", "HWI", "HWI", "HWI", "HWI", "HWI", "HWI", "HWI", "HWI",
"HWI", "HWI", "HWI", "HWI", "HWI", "HWI"), road = c("MK", "MK",
"MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK",
"MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK",
"MK", "MK", "MK", "MK", "MK"), transect = c(1L, 1L, 2L, 2L, 3L,
3L, 4L, 4L, 4L, 4L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L,
11L, 11L, 12L, 12L, 13L, 13L, 15L, 15L, 1L), fYear = c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), species = c("Plantago lanceolata",
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata",
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata",
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata",
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata",
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata",
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata",
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata"
), presence = c(1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -29L), groups = structure(list(
fYear = c(1L, 1L, 2L), road = c("MK", "MK", "MK"), species = c("Plantago lanceolata",
"Poa pratensis", "Plantago lanceolata"), .rows = list(c(1L,
3L, 5L, 7L, 9L, 11L, 13L, 15L, 17L, 19L, 21L, 23L, 25L, 27L
), c(2L, 4L, 6L, 8L, 10L, 12L, 14L, 16L, 18L, 20L, 22L, 24L,
26L, 28L), 29L)), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))
Thanks for your help!
Upvotes: 0
Views: 396
Reputation: 36086
I don't see how to extract the NA
rows from the coefficients table. The answer, instead, may be to add in the NA
rows after extracting the elements you want. This can be done with complete()
.
In this example I use broom::tidy()
because it makes it easy to get coefficients, a measure of uncertainty, and results of tests. This means I have to filter out the intercepts after the fact, but you could certainly do something similar with your functions if you add a column to complete()
on.
library(purrr)
library(tidyr)
library(dplyr)
mod.glm %>%
group_by(species, region, fYear, road) %>%
transmute(results = map(model, broom::tidy) ) %>%
unnest(results) %>%
complete(term = "transect") %>%
filter(term != "(Intercept)") %>%
ungroup()
# A tibble: 3 x 9
species region fYear road term estimate std.error statistic p.value
<chr> <chr> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Plantago lanceo~ HWI 1 MK transe~ -42.7 31127. -0.00137 0.999
2 Plantago lanceo~ HWI 2 MK transe~ NA NA NA NA
3 Poa pratensis HWI 1 MK transe~ -0.206 0.188 -1.10 0.272
Upvotes: 4
Reputation: 36086
Going a completely different route, you could change your extraction functions to return NA
when they give errors. This is a job for functions like tryCatch()
, but I've found possibly()
from purrr to be pretty handy for this sort of task.
possibly()
wraps around a function. The otherwise
arguments says what value to return if an error occurs when using the function.
Here are your two functions, wrapped in possibly()
. I've changed them to specifically be working with the "transect" row of the coefficients summary so this will error if that doesn't exist.
b_fun <- possibly(
function(mod) {
coef(summary(mod))["transect", 1]
}, otherwise = NA)
p_fun <- possibly(
function(mod) {
coef(summary(mod))["transect", 4]
}, otherwise = NA)
# extract coefficients
mod.glm %>%
group_by(species, region, fYear, road) %>%
transmute(beta = map_dbl(model, b_fun),
p_val = map_dbl(model, p_fun) ) %>%
ungroup()
# A tibble: 3 x 6
species region fYear road beta p_val
<chr> <chr> <int> <chr> <dbl> <dbl>
1 Plantago lanceolata HWI 1 MK -42.7 0.999
2 Poa pratensis HWI 1 MK -0.206 0.272
3 Plantago lanceolata HWI 2 MK NA NA
Upvotes: 4