Jeremy K.
Jeremy K.

Reputation: 1792

Summary Statistics table with factors and continuous variables

I am trying to create a simple summary statistics table (min, max, mean, n, etc) that handles both factor variables and continuous variables, even when there is more than one factor variable. I'm trying to produce good looking HTML output, eg stargazer or huxtable output.

For a simple reproducible example, I'll use mtcars but change two of the variables to factors, and simplify to three variables.

library(tidyverse)
library(stargazer)

mtcars_df <- mtcars
mtcars_df <- mtcars_df %>% 
  mutate(vs = factor(vs),
         am = factor(am)) %>% 
  select(mpg, vs, am)
head(mtcars_df)

So the data has two factor variables, vs and am. mpg is left as a double:

#>    mpg vs am
#>  <dbl> <fctr> <fctr>
#> 1 21.0  0  1
#> 2 21.0  0  1
#> 3 22.8  1  1
#> 4 21.4  1  0
#> 5 18.7  0  0
#> 6 18.1  1  0

My desired output would look something like this (format only, the numbers aren't all correct for am0):

======================================================
Statistic N   Mean  St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg       32 20.091  6.027   10    15.4     22.8   34 
vs0       32 0.562   0.504    0     0        1      1 
vs1       32 0.438   0.504    0     0        1      1 
am0       32 0.594   0.499    0     0        1      1 
am1       32 0.406   0.499    0     0        1      1 
------------------------------------------------------

A straight call to stargazer does not handle factors (but we have a solution for summarising one factor, below)

# this doesn't give factors
stargazer(mtcars_df, type = "text")
======================================================
Statistic N   Mean  St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg       32 20.091  6.027   10    15.4     22.8   34 
------------------------------------------------------

This previous answer from @jake-fisher works very well to summarise one factor variable. https://stackoverflow.com/a/26935270/8742237

The code below from the previous answer gives both values of the first factor vs, i.e. vs0 and vs1 but when it comes to the second factor, am, it only lists summary statistics for one value of am:

I do realise that this is because we want to avoid the dummy variable trap when modeling, but my issue is not about modeling, it's about creating a summary table with all values of all factor variables.

options(na.action = "na.pass")  # so that we keep missing values in the data
X <- model.matrix(~ . - 1, data = mtcars_df)
X.df <- data.frame(X)  # stargazer only does summary tables of data.frame objects
#names(X) <- colnames(X)
stargazer(X.df, type = "text")

======================================================
Statistic N   Mean  St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg       32 20.091  6.027   10    15.4     22.8   34 
vs0       32 0.562   0.504    0     0        1      1 
vs1       32 0.438   0.504    0     0        1      1 
am1       32 0.406   0.499    0     0        1      1 
------------------------------------------------------

While use of stargazer or huxtable would be preferred, if there's an easier way to produce this sort of summary table with a different library, that would still be very helpful.

Upvotes: 0

Views: 1007

Answers (1)

Jeremy K.
Jeremy K.

Reputation: 1792

In the end, instead of using model.matrix(), which is designed to drop the base case when creating dummy variables, a simple fix is to use mlr::createDummyFeatures(), which creates a Dummy for all values, even the base case.

library(tidyverse)
library(stargazer)
library(mlr)

mtcars_df <- mtcars
mtcars_df <- mtcars_df %>% 
  mutate(vs = factor(vs),
         am = factor(am)) %>% 
  select(mpg, vs, am)
head(mtcars_df)


X <- mlr::createDummyFeatures(obj = mtcars_df)
X.df <- data.frame(X)  # stargazer only does summary tables of data.frame objects
#names(X) <- colnames(X)
stargazer(X.df, type = "text")

which does give the desired output:

======================================================
Statistic N   Mean  St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg       32 20.091  6.027   10    15.4     22.8   34 
vs.0      32 0.562   0.504    0     0        1      1 
vs.1      32 0.438   0.504    0     0        1      1 
am.0      32 0.594   0.499    0     0        1      1 
am.1      32 0.406   0.499    0     0        1      1 
------------------------------------------------------

Upvotes: 1

Related Questions