NightDog
NightDog

Reputation: 91

Extract models based on conditions in r

I ran about 30,000 multiple regression models using pylr packages.

fit = plyr::dlply(data, "id", function(x) lm(t ~ d1 + d2 + d3, data = x))

We have more than 30,000 IDs so it came out with more than 30,000 models. I cannot read the models one by one. Are there any efficient ways to look at the models based on the conditions. I want to look at models according to

top 5 p-value (from the smallest)
top r squred
all the models with d3 significant

At this moment, I can only look at the model one by one using l_ply(fit, summary, .print = TRUE)

I tried to use tidy function in broom packages in order to change the results into cleaned format so that i can export .csv file but it wouldn't work out.

test = broom::tidy(fit)

I received error messages telling me Error in tidy.list(fit) : No tidying method recognized for this list

Any advice would be appreciated!

Here are some data added:

df <- data.frame(id=sample(LETTERS[1:10], replace= T, 10),
             d1=rnorm(20), d2=rnorm(20), d3=rnorm(20), t= rnorm(20), stringsAsFactors = F)

fit1 = plyr::dlply(df, "id", function(x) lm(t ~ d1 + d2 + d3, data = x))

Upvotes: 2

Views: 101

Answers (2)

camille
camille

Reputation: 16842

Here's a workflow based on the Many Models article, using broom functions on nested data frames. Note that I increased the size of the data frame, just to have more observations to work with.

With this method, you group data by ID, nest it, and create the lm objects by mapping over the nested data. broom::glance gives you summary output on the models:

library(tidyverse)
library(broom)

set.seed(124)
df <- data.frame(id=sample(LETTERS[1:10], size = 10, replace = T),
                 d1=rnorm(100), d2=rnorm(100), d3=rnorm(100), t= rnorm(100), stringsAsFactors = F)

models_df <- df %>%
  group_by(id) %>%
  nest() %>%
  mutate(mod = map(data, function(x) lm(t ~ d1 + d2 + d3, data = x))) %>%
  mutate(glance = map(mod, glance)) %>%
  unnest(glance)

models_df
#> # A tibble: 6 x 14
#>   id    data   mod   r.squared adj.r.squared sigma statistic p.value    df
#>   <chr> <list> <lis>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>
#> 1 A     <tibb… <S3:…    0.495         0.243  0.955     1.96   0.221      4
#> 2 E     <tibb… <S3:…    0.142        -0.0189 0.907     0.883  0.471      4
#> 3 F     <tibb… <S3:…    0.179         0.0247 1.05      1.16   0.355      4
#> 4 D     <tibb… <S3:…    0.818         0.727  0.522     8.97   0.0123     4
#> 5 C     <tibb… <S3:…    0.0514       -0.0580 0.994     0.470  0.706      4
#> 6 J     <tibb… <S3:…    0.202        -0.197  0.933     0.506  0.693      4
#> # ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#> #   deviance <dbl>, df.residual <int>

Now that you have a data frame with summary statistics like R-squared, you can use regular dplyr commands like top_n to get rankings:

# smallest 5 p-value
models_df %>% top_n(-5, p.value)
#> # A tibble: 5 x 14
#>   id    data   mod   r.squared adj.r.squared sigma statistic p.value    df
#>   <chr> <list> <lis>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>
#> 1 A     <tibb… <S3:…     0.495        0.243  0.955     1.96   0.221      4
#> 2 E     <tibb… <S3:…     0.142       -0.0189 0.907     0.883  0.471      4
#> 3 F     <tibb… <S3:…     0.179        0.0247 1.05      1.16   0.355      4
#> 4 D     <tibb… <S3:…     0.818        0.727  0.522     8.97   0.0123     4
#> 5 J     <tibb… <S3:…     0.202       -0.197  0.933     0.506  0.693      4
#> # ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#> #   deviance <dbl>, df.residual <int>

# top r-sq
models_df %>% top_n(1, r.squared)
#> # A tibble: 1 x 14
#>   id    data   mod   r.squared adj.r.squared sigma statistic p.value    df
#>   <chr> <list> <lis>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>
#> 1 D     <tibb… <S3:…     0.818         0.727 0.522      8.97  0.0123     4
#> # ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#> #   deviance <dbl>, df.residual <int>

For significant d3 terms, I went back to the models data frame and mapped broom::tidy across the models, which gives summary stats on each term, then filtered to study the d3 terms.

# significant d3--assuming alpha = 0.05
models_df %>%
  mutate(tidied = map(mod, tidy)) %>%
  unnest(tidied, .drop = T) %>%
  filter(term == "d3") %>%
  filter(p.value < 0.05)
#> # A tibble: 1 x 17
#>   id    r.squared adj.r.squared sigma statistic p.value    df logLik   AIC
#>   <chr>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl>
#> 1 D         0.818         0.727 0.522      8.97  0.0123     4  -5.13  20.3
#> # ... with 8 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
#> #   term <chr>, estimate <dbl>, std.error <dbl>, statistic1 <dbl>,
#> #   p.value1 <dbl>

Created on 2018-06-20 by the reprex package (v0.2.0).

Upvotes: 2

dww
dww

Reputation: 31452

Here's an example of extracting the 3 highest R^2 models.

r2 = unlist(lapply(fit, function(x) summary(x)$r.squared))
fit[order(r2, decreasing = T)[1:3]]

The others are analagous: See this post https://stackoverflow.com/a/5587781/2761575 for how to extract p-values if you want to rank by these instead

Upvotes: 2

Related Questions