Reputation: 3310
I'm using the R package stargazer to create high-quality regression tables, and I would like to use it to create a summary statistics table. I have a factor variable in my data, and I would like the summary table to show me the percent in each category of the factor -- in effect, separate the factor into a set of mutually exclusive logical (dummy) variables, and then display those in the table. Here's an example:
> library(car)
> library(stargazer)
> data(Blackmore)
> stargazer(Blackmore[, c("age", "exercise", "group")], type = "text")
==========================================
Statistic N Mean St. Dev. Min Max
------------------------------------------
age 945 11.442 2.766 8.000 17.920
exercise 945 2.531 3.495 0.000 29.960
------------------------------------------
But I'm trying to get an additional row that shows me the percent in each group (% control and/or % patient, in these data). I'm sure this is just an option somewhere in stargazer, but I can't find it. Does anyone know what it is?
Edit: car::Blackmoor
has updated spelling to car::Blackmore
.
Upvotes: 6
Views: 7775
Reputation: 311
This has been a struggle for me. I like how Stargazer looks but do not like how it does not produce factor variable summary statistics at each level. This worked for me, hopefully it saves someone headaches in the future.
You have to create dummy variables quickly to do this. I use the fastDummies package. And then you will also have to create two lists of columns for those variables that are factors, and those which are not.
library('stargazer')
library('fastDummies')
factor_cols <- c("x", "y", "z")
nonfactor_cols <- c("u", "v")
df <- dummy_cols(df[, c(factorcols, nonfactor_cols)])
df <- df[, !names(df) %in% factor_cols] # This will remove the duplicate columns that were created.
stargazer(df,
type = "html",
out = "summary.htm")
Note that the variable labels become messed up in the final output. But I usually change covariate names manually at the end anyway, so it is fine.
Upvotes: 0
Reputation: 3310
Another workaround is to use model.matrix
to create dummy variables in a separate step, and then use stargazer
to create a table from that. To show this with the example:
> library(car)
> library(stargazer)
> data(Blackmore)
>
> options(na.action = "na.pass") # so that we keep missing values in the data
> X <- model.matrix(~ age + exercise + group - 1, data = Blackmore)
> X.df <- data.frame(X) # stargazer only does summary tables of data.frame objects
> names(X) <- colnames(X)
> stargazer(X.df, type = "text")
=============================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------
age 945 11.442 2.766 8.000 17.920
exercise 945 2.531 3.495 0.000 29.960
groupcontrol 945 0.380 0.486 0 1
grouppatient 945 0.620 0.486 0 1
---------------------------------------------
Edit: car::Blackmoor
has updated spelling to car::Blackmore
.
Upvotes: 2
Reputation: 2988
The package tables
can be useful for this task.
library(car)
library(tables)
data(Blackmore)
# percent only:
(x <- tabular((Factor(group, "") ) ~ (Pct=Percent()) * Format(digits=4),
data=Blackmore))
##
## Pct
## control 37.99
## patient 62.01
# percent and counts:
(x <- tabular((Factor(group, "") ) ~ ((n=1) + (Pct=Percent())) * Format(digits=4),
data=Blackmore))
##
## n Pct
## control 359.00 37.99
## patient 586.00 62.01
Then it's straightforward to output this to LaTeX:
> latex(x)
\begin{tabular}{lcc}
\hline
& n & \multicolumn{1}{c}{Pct} \\
\hline
control & $359.00$ & $\phantom{0}37.99$ \\
patient & $586.00$ & $\phantom{0}62.01$ \\
\hline
\end{tabular}
Upvotes: 1
Reputation: 38689
Since Stargazer can't do this directly, you can create your own summary table as a data frame and output that using pander, xtable, or any other package. For example, here's how you can use dplyr and tidyr to create a summary table:
library(dplyr)
library(tidyr)
fancy.summary <- Blackmoor %>%
select(-subject) %>% # Remove the subject column
group_by(group) %>% # Group by patient and control
summarise_each(funs(mean, sd, min, max, length)) %>% # Calculate summary statistics for each group
mutate(prop = age_length / sum(age_length)) %>% # Calculate proportion
gather(variable, value, -group, -prop) %>% # Convert to long
separate(variable, c("variable", "statistic")) %>% # Split variable column
mutate(statistic = ifelse(statistic == "length", "n", statistic)) %>%
spread(statistic, value) %>% # Make the statistics be actual columns
select(group, variable, n, mean, sd, min, max, prop) # Reorder columns
Which results in this if you use pander:
library(pander)
pandoc.table(fancy.summary)
------------------------------------------------------
group variable n mean sd min max prop
------- ---------- --- ------ ----- ----- ----- ------
control age 359 11.26 2.698 8 17.92 0.3799
control exercise 359 1.641 1.813 0 11.54 0.3799
patient age 586 11.55 2.802 8 17.92 0.6201
patient exercise 586 3.076 4.113 0 29.96 0.6201
------------------------------------------------------
Upvotes: 5