micah
micah

Reputation: 1248

dplyr get linear regression coefficients

I'm wondering if there is a better way is to get linear regression coefficients as columns in dplyr. Here is some sample data.

mydata <- 
  data.frame(
    Site = c(1,1,1,1,1,1,1,1),
    Site1 = c(2,3,2,3,2,3,2,3),
    Age = c(17, 52, 19, 18, 62, 53, 41, 24),
    Gender = c(1,2,1,1,2,2,2,1),
    Outcome = c(1,1,1,1,0,0,0,1)
  ) 

I wrote this helper function to turn summary(.data)$coefficients into columns

GetCoefficients <- function(.data){
  AllData <- data.frame()
  AllData[1, ] <- ""
  col_names <- colnames(summary(.data)$coefficients)
  row_names <- rownames(summary(.data)$coefficients)
  row_len <- length(row_names)
  col_len <- length(col_names)-1
  x <- summary(.data)$coefficients
  for (i in 1:length(x)){
    AllData <- AllData %>%
      mutate(!!paste0(row_names[ifelse(i%%row_len != 0, i%%row_len, row_len)],
                      "_",col_names[ceiling(i/col_len)]) := x[i])
  }
  return(AllData)
} 

Using the helper function I can put coefficients into my data.frame()

Linear_regression <- mydata %>%
  pivot_longer(starts_with("Site"),
               names_to = ".value",
               names_pattern = "(^Site)") %>% 
  group_by(Site) %>% 
  do(Reg = lm(Outcome ~ Age + Gender, data = .)) %>%
  mutate(rsq = summary(Reg)$r.squared) %>% 
  mutate(fun = GetCoefficients(Reg))

Upvotes: 3

Views: 1462

Answers (2)

dipetkov
dipetkov

Reputation: 3700

The function broom::tidy extracts the coefficient estimates as well as their standard errors, t-statistics & p-values into rows.

library("tidyverse")

mydata <- tibble::tribble(
  ~Age, ~Gender, ~Outcome, ~Site,
  17, 1, 1, "1",
  17, 1, 1, "2",
  52, 2, 1, "1",
  52, 2, 1, "3",
  19, 1, 1, "1",
  19, 1, 1, "2",
  18, 1, 1, "1",
  18, 1, 1, "3",
  62, 2, 0, "1",
  62, 2, 0, "2",
  53, 2, 0, "1",
  53, 2, 0, "3",
  41, 2, 0, "1",
  41, 2, 0, "2",
  24, 1, 1, "1",
  24, 1, 1, "3"
)

mydata %>%
  group_by(
    Site
  ) %>%
  group_modify(
    ~ broom::tidy(lm(Outcome ~ Age + Gender, data = .))
  )
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
#> # A tibble: 9 × 6
#> # Groups:   Site [3]
#>   Site  term         estimate std.error statistic  p.value
#>   <chr> <chr>           <dbl>     <dbl>     <dbl>    <dbl>
#> 1 1     (Intercept)  1.75e+ 0  5.37e- 1  3.26e+ 0 2.25e- 2
#> 2 1     Age         -9.13e-18  2.44e- 2 -3.74e-16 1   e+ 0
#> 3 1     Gender      -7.50e- 1  8.40e- 1 -8.92e- 1 4.13e- 1
#> 4 2     (Intercept)  2   e+ 0  4.20e-16  4.76e+15 1.34e-16
#> 5 2     Age          9.08e-18  1.49e-17  6.10e- 1 6.51e- 1
#> 6 2     Gender      -1   e+ 0  5.46e-16 -1.83e+15 3.48e-16
#> 7 3     (Intercept)  1.22e+ 0  2.03e+ 0  6.00e- 1 6.56e- 1
#> 8 3     Age         -2.70e- 2  1.62e- 1 -1.67e- 1 8.95e- 1
#> 9 3     Gender       3.51e- 1  5.16e+ 0  6.82e- 2 9.57e- 1

Sometimes however, we would like to extract only the coefficient estimates into columns.

mydata %>%
  group_by(
    Site
  ) %>%
  group_modify(
    ~ bind_rows(coefficients(lm(Outcome ~ Age + Gender, data = .)))
  )
#> # A tibble: 3 × 4
#> # Groups:   Site [3]
#>   Site  `(Intercept)`       Age Gender
#>   <chr>         <dbl>     <dbl>  <dbl>
#> 1 1              1.75 -9.13e-18 -0.750
#> 2 2              2     9.08e-18 -1    
#> 3 3              1.22 -2.70e- 2  0.351

Created on 2022-04-20 by the reprex package (v2.0.1)

Upvotes: 1

TarJae
TarJae

Reputation: 79194

Here is a combination of tidyverse and broom package to get your desired output.

Very handy here is group_split -> you get a list and then you iterate with purrrs map_dfr (by the way with map_dfr you get a dataframe otherwise with map you get a list) your regression lm(... through each list element. Using brooms glance gives the desired output:

library(tidyverse)
library(broom)

mydata %>%
  pivot_longer(starts_with("Site"),
               names_to = ".value",
               names_pattern = "(^Site)") %>% 
  mutate(Site=as.factor(Site)) %>% 
  group_by(Site) %>% 
  group_split() %>% 
  map_dfr(.f = function(df){
    lm(Outcome ~ Age+Gender, data=df) %>% 
      glance() %>% 
      add_column(Site = unique(df$Site), .before = 1)
  })
  Site  r.squared adj.r.squared    sigma statistic  p.value    df logLik    AIC     BIC deviance df.residual  nobs
  <fct>     <dbl>         <dbl>    <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>   <dbl>    <dbl>       <int> <int>
1 1         0.6           0.44  3.87e- 1  3.75e+ 0 1.01e- 1     2  -1.88   11.8   12.1  7.5 e- 1           5     8
2 2         1             1     2.22e-16  1.01e+31 2.22e-16     2 141.   -275.  -277.   4.93e-32           1     4
3 3         0.351        -0.946 6.97e- 1  2.71e- 1 8.05e- 1     2  -1.46   10.9    8.47 4.86e- 1           1     4

Upvotes: 6

Related Questions