Reputation: 91
I ran about 30,000 multiple regression models using pylr
packages.
fit = plyr::dlply(data, "id", function(x) lm(t ~ d1 + d2 + d3, data = x))
We have more than 30,000 IDs so it came out with more than 30,000 models. I cannot read the models one by one. Are there any efficient ways to look at the models based on the conditions. I want to look at models according to
top 5 p-value (from the smallest)
top r squred
all the models with d3 significant
At this moment, I can only look at the model one by one using l_ply(fit, summary, .print = TRUE)
I tried to use tidy
function in broom
packages in order to change the results into cleaned format so that i can export .csv file but it wouldn't work out.
test = broom::tidy(fit)
I received error messages telling me Error in tidy.list(fit) : No tidying method recognized for this list
Any advice would be appreciated!
Here are some data added:
df <- data.frame(id=sample(LETTERS[1:10], replace= T, 10),
d1=rnorm(20), d2=rnorm(20), d3=rnorm(20), t= rnorm(20), stringsAsFactors = F)
fit1 = plyr::dlply(df, "id", function(x) lm(t ~ d1 + d2 + d3, data = x))
Upvotes: 2
Views: 101
Reputation: 16842
Here's a workflow based on the Many Models article, using broom
functions on nested data frames. Note that I increased the size of the data frame, just to have more observations to work with.
With this method, you group data by ID, nest it, and create the lm
objects by mapping over the nested data. broom::glance
gives you summary output on the models:
library(tidyverse)
library(broom)
set.seed(124)
df <- data.frame(id=sample(LETTERS[1:10], size = 10, replace = T),
d1=rnorm(100), d2=rnorm(100), d3=rnorm(100), t= rnorm(100), stringsAsFactors = F)
models_df <- df %>%
group_by(id) %>%
nest() %>%
mutate(mod = map(data, function(x) lm(t ~ d1 + d2 + d3, data = x))) %>%
mutate(glance = map(mod, glance)) %>%
unnest(glance)
models_df
#> # A tibble: 6 x 14
#> id data mod r.squared adj.r.squared sigma statistic p.value df
#> <chr> <list> <lis> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 A <tibb… <S3:… 0.495 0.243 0.955 1.96 0.221 4
#> 2 E <tibb… <S3:… 0.142 -0.0189 0.907 0.883 0.471 4
#> 3 F <tibb… <S3:… 0.179 0.0247 1.05 1.16 0.355 4
#> 4 D <tibb… <S3:… 0.818 0.727 0.522 8.97 0.0123 4
#> 5 C <tibb… <S3:… 0.0514 -0.0580 0.994 0.470 0.706 4
#> 6 J <tibb… <S3:… 0.202 -0.197 0.933 0.506 0.693 4
#> # ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#> # deviance <dbl>, df.residual <int>
Now that you have a data frame with summary statistics like R-squared, you can use regular dplyr
commands like top_n
to get rankings:
# smallest 5 p-value
models_df %>% top_n(-5, p.value)
#> # A tibble: 5 x 14
#> id data mod r.squared adj.r.squared sigma statistic p.value df
#> <chr> <list> <lis> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 A <tibb… <S3:… 0.495 0.243 0.955 1.96 0.221 4
#> 2 E <tibb… <S3:… 0.142 -0.0189 0.907 0.883 0.471 4
#> 3 F <tibb… <S3:… 0.179 0.0247 1.05 1.16 0.355 4
#> 4 D <tibb… <S3:… 0.818 0.727 0.522 8.97 0.0123 4
#> 5 J <tibb… <S3:… 0.202 -0.197 0.933 0.506 0.693 4
#> # ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#> # deviance <dbl>, df.residual <int>
# top r-sq
models_df %>% top_n(1, r.squared)
#> # A tibble: 1 x 14
#> id data mod r.squared adj.r.squared sigma statistic p.value df
#> <chr> <list> <lis> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 D <tibb… <S3:… 0.818 0.727 0.522 8.97 0.0123 4
#> # ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#> # deviance <dbl>, df.residual <int>
For significant d3
terms, I went back to the models data frame and mapped broom::tidy
across the models, which gives summary stats on each term, then filtered to study the d3
terms.
# significant d3--assuming alpha = 0.05
models_df %>%
mutate(tidied = map(mod, tidy)) %>%
unnest(tidied, .drop = T) %>%
filter(term == "d3") %>%
filter(p.value < 0.05)
#> # A tibble: 1 x 17
#> id r.squared adj.r.squared sigma statistic p.value df logLik AIC
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 D 0.818 0.727 0.522 8.97 0.0123 4 -5.13 20.3
#> # ... with 8 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
#> # term <chr>, estimate <dbl>, std.error <dbl>, statistic1 <dbl>,
#> # p.value1 <dbl>
Created on 2018-06-20 by the reprex package (v0.2.0).
Upvotes: 2
Reputation: 31452
Here's an example of extracting the 3 highest R^2 models.
r2 = unlist(lapply(fit, function(x) summary(x)$r.squared))
fit[order(r2, decreasing = T)[1:3]]
The others are analagous: See this post https://stackoverflow.com/a/5587781/2761575 for how to extract p-values if you want to rank by these instead
Upvotes: 2