Reputation: 1926
I am trying to mimic the table
Stata command in R, which performs summary statistics tables. The command allows you to create cross tables with diverse statistics inside of the resulting cells. For instance, in my example below, I am crossing three variables (category1
, category2
, and category3
) and getting as a column vector the mean and standard deviation of metric1
and the mean and standard deviation metric2
.
The stated behavior is obtained with the following single line on Stata.
table category1 category2 category3 ,c(mean metric1 sd metric1 mean metric2 sd metric2)
Here each column vector of the resulting cross table, lets say X
of the cross table contains X = [mean(metric1),sd(metric1), mean(metric2),sd(metric2)]'
----------------------------------------------------------------------------
| category3 and category2
| ------------ First ----------- ----------- Second -----------
category1 | A B C Total A B C Total
----------+-----------------------------------------------------------------
1 | mean(metric1)
| sd(metric1)
| mean(metric2)
| sd(metric1)
----------------------------------------------------------------------------
| category3 and category2
| ------------ First ----------- ----------- Second -----------
category1 | A B C Total A B C Total
----------+-----------------------------------------------------------------
1 | 5.778 7.200 2.571 5.048 6.667 3.000 3.000 4.222
| 2.906 3.347 2.507 3.324 2.309 1.414 1.155 2.333
| -1.556 -2.000 -1.143 -1.524 -2.000 -2.000 -3.000 -2.444
| 1.667 0.000 1.069 1.250 0.000 2.828 1.155 1.333
|
2 | 3.200 6.333 4.200 4.571 4.889 5.000 5.000 4.947
| 2.280 3.445 2.741 2.976 3.180 3.464 2.449 2.857
| -0.800 -2.000 -2.000 -1.714 -2.222 -1.500 -1.000 -1.684
| 1.095 1.265 1.333 1.309 1.563 1.000 1.673 1.529
|
3 | 8.667 4.667 5.167 5.667 5.667 6.667 6.000 6.000
| 2.309 2.309 2.758 2.849 3.445 4.163 3.464 3.303
| -3.333 -2.667 -2.000 -2.333 -2.333 -2.000 -1.333 -2.000
| 1.155 1.155 1.477 1.414 0.816 2.000 1.155 1.206
|
Total | 5.529 6.286 4.207 5.067 5.444 5.111 4.615 5.100
| 3.125 3.124 2.795 3.047 3.053 3.333 2.501 2.898
| -1.647 -2.143 -1.793 -1.833 -2.222 -1.778 -1.692 -1.950
| 1.618 0.949 1.346 1.342 1.166 1.563 1.601 1.395
----------------------------------------------------------------------------
clear all
set obs 100
set seed 777
gen category1 = runiformint(1,3)
gen category2_num = runiformint(1,3)
gen category2 = "A" if category2_num ==1
replace category2 = "B" if category2_num ==2
replace category2 = "C" if category2_num ==3
drop category2_num
gen category3_num = runiformint(1,2)
gen category3 = "First" if category3_num ==1
replace category3 = "Second" if category3_num ==2
drop category3_num
gen metric1 = round(runiform()*10,2)
gen metric2 = round(runiform()*-4,2)
table category1 category2 category3 /// List of the variables that will create the crosstab
,c(mean metric1 sd metric1 /// Mean and std.dev of metric1 as 1st and 2nd rows
mean metric2 sd metric2) /// Mean and std.dev of metric2 as 3rd and 4th rows
row col /// Add the over all statistics total rows and cols
format(%9.3f) // Decimal style setting.
Here is how I have tackled the problem. However, I am still far from my desired results. Even though I have the same information displayed on the screen, the readability is very poor in the way I am presenting it on R. Additionally, I haven't computed the mean and standard deviation for the total of rows and columns and I did on the Stata output.
Finally, in my opinion, this procedure seems like an overkill solution for such a simple problem. In my context packages are allowed, hence, dplyr
or data.table
suggestions are welcome.
df <- as.data.frame(structure(list(category1 = structure(c(1, 3, 1, 2, 3, 1, 3, 1,1, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 1, 3, 1, 3, 3, 1, 3, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 3, 3, 2, 2, 2, 3, 1, 2, 3, 2, 3, 2, 2, 1,3, 3, 3, 2, 2, 1, 1, 1, 3, 2, 3, 1, 2, 2, 1, 3, 1, 3, 1, 1, 3,1, 1, 2, 1, 3, 2, 2, 3, 3, 3, 1, 2, 3, 2, 3, 2, 1, 1, 1, 2, 2,2, 1, 3, 2, 2, 2, 3, 3), format.stata = "%9.0g"),
category2 = structure(c("C", "A", "A", "A", "C", "C", "A", "A", "A", "A", "B", "A", "A", "A","A", "B", "A", "C", "C", "B", "C", "A", "A", "C", "A", "B", "C", "B", "C", "C", "A", "C", "B", "B", "A", "B", "C", "A", "B", "B","C", "A", "A", "C", "C", "B", "C", "A", "A", "C", "C", "B", "C", "C", "A", "C", "A", "A", "C", "B", "A", "C", "C", "C", "B", "B","C", "C", "A", "A", "C", "C", "A", "C", "B", "B", "C", "C", "C", "C", "A", "C", "C", "C", "C", "B", "B", "B", "B", "C", "A", "A","C", "C", "A", "A", "A", "B", "B", "C"), format.stata = "%9s"),
category3 = structure(c("First", "Second", "First", "First", "First", "First", "Second", "Second", "First", "Second", "First", "First", "Second", "Second", "First", "Second", "Second", "First", "Second", "First", "First", "First", "First","Second", "First", "First", "Second", "First", "First", "First","First", "First", "First", "Second", "First", "First", "First", "Second", "First", "First", "First", "Second", "First", "First","Second", "Second", "First", "Second", "Second", "Second","First", "First", "First", "Second", "Second", "First", "First","Second", "First", "First", "First", "First", "Second", "First","Second", "Second", "First", "Second", "First", "Second", "First", "Second", "First", "First", "First", "First", "Second","First", "First", "First", "Second", "Second", "First", "First","First", "Second", "First", "Second", "First", "Second","Second", "First", "Second", "First", "First", "Second","Second", "Second", "Second", "First"), format.stata = "%9s"),
metric1 = structure(c(0, 10, 0, 0, 8, 4, 4, 8, 8, 2, 4, 4, 6, 2, 6, 8, 6, 4, 4, 10, 10, 4, 6, 8, 6, 2, 4, 4, 6, 0, 6,0, 10, 8, 2, 2, 2, 0, 2, 10, 2, 8, 4, 6, 8, 2, 2, 6, 0, 2,4, 6, 2, 2, 8, 6, 8, 8, 2, 8, 10, 4, 4, 4, 4, 10, 4, 2, 6,4, 6, 4, 10, 2, 8, 6, 8, 2, 6, 6, 6, 4, 8, 6, 8, 2, 10, 2, 6, 2, 10, 4, 8, 0, 10, 6, 4, 2, 8, 8), format.stata = "%9.0g"),
metric2 = structure(c(0, -4, 0, 0, -2, -2, -2, -2, -4, -2, -2, -2, -2, -4, 0, 0, -2, -2, -4, -2, 0, -2, -4, -2, -2, -2, -2, -2, -4, 0, -4, -4, -2, -2, -2, -2, -2, -2, -4, -2, -2, -2, -2, -2, 0, -2, -4, -4, -2, -2, 0, -4, -2, 0, -2,-2, 0, -2, -4, 0, -2, -2, 0, 0, -4, -4, 0, -2, 0, -2, -2, -4, 0, -2, -2, -2, 0, -2, -2, -2, -2, -2, -2, 0, 0, 0, -2, 0, -2, -4, 0, 0, 0, -2, -4, -4, 0, -2, -2, -4), format.stata = "%9.0g")),
row.names = c(NA,-100L), class = c("tbl_df", "tbl", "data.frame")))
# expand grid for every possible value
prs <- expand.grid(cat1 = unique(df$category1) ,
cat2 = unique(df$category2) ,
cat3 = unique(df$category3))
#Number of total combinations
N <- nrow(prs)
#Loop over the combinations to get the desired statistis
A <- lapply(1:N, FUN = function(i){
mean1 <- mean(df[(df$category1 == prs$cat1[i] & df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric1"])
sd1 <- sd(df[(df$category1 == prs$cat1[i] & df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric1"])
mean2 <- mean(df[(df$category1 == prs$cat1[i] & df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric2"])
sd2 <- sd(df[(df$category1 == prs$cat1[i] & df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric2"])
r_list<- list(cat1 = prs$cat1[i],cat2 = prs$cat2[i], cat3 = prs$cat3[i],
mean1 = mean1, sd1 = sd1 , mean2 = mean2, sd2 = sd2)
return(r_list)
})
#List to data.frame
df_stats <- do.call(rbind.data.frame, A)
# cat1 cat2 cat3 mean1 sd1 mean2 sd2
# 2 1 C First 2.571429 2.507133 -1.142857 1.0690450
# 21 3 C First 5.166667 2.757909 -2.000000 1.4770979
# 3 2 C First 4.200000 2.740641 -2.000000 1.3333333
# 4 1 A First 5.777778 2.905933 -1.555556 1.6666667
# 5 3 A First 8.666667 2.309401 -3.333333 1.1547005
# 6 2 A First 3.200000 2.280351 -0.800000 1.0954451
# 7 1 B First 7.200000 3.346640 -2.000000 0.0000000
# 8 3 B First 4.666667 2.309401 -2.666667 1.1547005
# 9 2 B First 6.333333 3.444803 -2.000000 1.2649111
# 10 1 C Second 3.000000 1.154701 -3.000000 1.1547005
# 11 3 C Second 6.000000 3.464102 -1.333333 1.1547005
# 12 2 C Second 5.000000 2.449490 -1.000000 1.6733201
# 13 1 A Second 6.666667 2.309401 -2.000000 0.0000000
# 14 3 A Second 5.666667 3.444803 -2.333333 0.8164966
# 15 2 A Second 4.888889 3.179797 -2.222222 1.5634719
# 16 1 B Second 3.000000 1.414214 -2.000000 2.8284271
# 17 3 B Second 6.666667 4.163332 -2.000000 2.0000000
# 18 2 B Second 5.000000 3.464102 -1.500000 1.0000000
Upvotes: 0
Views: 791
Reputation: 6489
You could use data.table
and magrittr
packages as follows:
library(magrittr)
library(data.table)
# function to compute the mean and sd
fun <- function(x, y) list(metric1_meam=mean(x), metric1_sd=sd(x), metric2_meam=mean(y), metric2_sd=sd(y))
# compute the Total column, and A,B,C columns of the desired output as follows and bind them
setDT(df)[, 'category1' := as.character(category1)]
Y <- rbind(
df[, fun(metric1, metric2), by=.(category1, category2, category3)],
df[, fun(metric1, metric2), by=.(category1, category3)][, category2 := 'Total'],
df[, fun(metric1, metric2), by=.(category2, category3)][, category1 := 'Total'],
df[, fun(metric1, metric2), by=.(category3)][, c('category1', 'category2') := 'Total']
)
# generate the desired output
melt(Y, measure=patterns('metric')) %>%
xtabs(formula = value ~ .) %>%
ftable(col.vars = c('category3', 'category2'))
category3 First Second
category2 A B C Total A B C Total
category1 variable
1 metric1_meam 5.7777778 7.2000000 2.5714286 5.0476190 6.6666667 3.0000000 3.0000000 4.2222222
metric1_sd 2.9059326 3.3466401 2.5071327 3.3237959 2.3094011 1.4142136 1.1547005 2.3333333
metric2_meam -1.5555556 -2.0000000 -1.1428571 -1.5238095 -2.0000000 -2.0000000 -3.0000000 -2.4444444
metric2_sd 1.6666667 0.0000000 1.0690450 1.2497619 0.0000000 2.8284271 1.1547005 1.3333333
2 metric1_meam 3.2000000 6.3333333 4.2000000 4.5714286 4.8888889 5.0000000 5.0000000 4.9473684
metric1_sd 2.2803509 3.4448028 2.7406406 2.9760952 3.1797973 3.4641016 2.4494897 2.8572264
metric2_meam -0.8000000 -2.0000000 -2.0000000 -1.7142857 -2.2222222 -1.5000000 -1.0000000 -1.6842105
metric2_sd 1.0954451 1.2649111 1.3333333 1.3093073 1.5634719 1.0000000 1.6733201 1.5294382
3 metric1_meam 8.6666667 4.6666667 5.1666667 5.6666667 5.6666667 6.6666667 6.0000000 6.0000000
metric1_sd 2.3094011 2.3094011 2.7579087 2.8491485 3.4448028 4.1633320 3.4641016 3.3028913
metric2_meam -3.3333333 -2.6666667 -2.0000000 -2.3333333 -2.3333333 -2.0000000 -1.3333333 -2.0000000
metric2_sd 1.1547005 1.1547005 1.4770979 1.4142136 0.8164966 2.0000000 1.1547005 1.2060454
Total metric1_meam 5.5294118 6.2857143 4.2068966 5.0666667 5.4444444 5.1111111 4.6153846 5.1000000
metric1_sd 3.1248529 3.1238185 2.7951400 3.0469027 3.0529103 3.3333333 2.5012817 2.8982753
metric2_meam -1.6470588 -2.1428571 -1.7931034 -1.8333333 -2.2222222 -1.7777778 -1.6923077 -1.9500000
metric2_sd 1.6179144 0.9492623 1.3464055 1.3424827 1.1659662 1.5634719 1.6012815 1.3950462
Upvotes: 1
Reputation: 72758
You may exploit the power of aggregate
.
FUN <- function(x) c(mean=mean(x), sd=sd(x))
aggregate(cbind(metric1, metric2) ~ ., df, FUN)
# category1 category2 category3 metric1.mean metric1.sd metric2.mean metric2.sd
# 1 1 A First 5.777778 2.905933 -1.5555556 1.6666667
# 2 2 A First 3.200000 2.280351 -0.8000000 1.0954451
# 3 3 A First 8.666667 2.309401 -3.3333333 1.1547005
# 4 1 B First 7.200000 3.346640 -2.0000000 0.0000000
# 5 2 B First 6.333333 3.444803 -2.0000000 1.2649111
# 6 3 B First 4.666667 2.309401 -2.6666667 1.1547005
# 7 1 C First 2.571429 2.507133 -1.1428571 1.0690450
# 8 2 C First 4.200000 2.740641 -2.0000000 1.3333333
# 9 3 C First 5.166667 2.757909 -2.0000000 1.4770979
# 10 1 A Second 6.666667 2.309401 -2.0000000 0.0000000
# 11 2 A Second 4.888889 3.179797 -2.2222222 1.5634719
# 12 3 A Second 5.666667 3.444803 -2.3333333 0.8164966
# 13 1 B Second 3.000000 1.414214 -2.0000000 2.8284271
# 14 2 B Second 5.000000 3.464102 -1.5000000 1.0000000
# 15 3 B Second 6.666667 4.163332 -2.0000000 2.0000000
# 16 1 C Second 3.000000 1.154701 -3.0000000 1.1547005
# 17 2 C Second 5.000000 2.449490 -1.0000000 1.6733201
# 18 3 C Second 6.000000 3.464102 -1.3333333 1.1547005
For cross tabulation try xtabs
.
aggregate
yields matrices in columns when multiple functions are applied(see this answer, why), so first we get rid of them.
r <- do.call(data.frame, aggregate(cbind(metric1, metric2) ~ ., df, FUN))
Now we can apply xtabs
, e.g. For each of the category3.
xtabs(cbind(metric1.mean, metric1.sd) ~ ., r[r$category3 == "First", 1:5])
# , , category3 = First, = metric1.mean
#
# category2
# category1 A B C
# 1 5.777778 7.200000 2.571429
# 2 3.200000 6.333333 4.200000
# 3 8.666667 4.666667 5.166667
#
# , , category3 = First, = metric1.sd
#
# category2
# category1 A B C
# 1 2.905933 3.346640 2.507133
# 2 2.280351 3.444803 2.740641
# 3 2.309401 2.309401 2.757909
xtabs(cbind(metric1.mean, metric1.sd) ~ ., r[r$category3 == "Second", 1:5])
# , , category3 = Second, = metric1.mean
#
# category2
# category1 A B C
# 1 6.666667 3.000000 3.000000
# 2 4.888889 5.000000 5.000000
# 3 5.666667 6.666667 6.000000
#
# , , category3 = Second, = metric1.sd
#
# category2
# category1 A B C
# 1 2.309401 1.414214 1.154701
# 2 3.179797 3.464102 2.449490
# 3 3.444803 4.163332 3.464102
Or use sapply
to do this in one step.
sapply(c("First", "Second"), function(c3)
xtabs(cbind(metric1.mean, metric1.sd) ~ ., r[r$category3 == c3, 1:5]),
simplify="array")
# , , category3 = First, = metric1.mean, = First
#
# category2
# category1 A B C
# 1 5.777778 7.200000 2.571429
# 2 3.200000 6.333333 4.200000
# 3 8.666667 4.666667 5.166667
#
# , , category3 = First, = metric1.sd, = First
#
# category2
# category1 A B C
# 1 2.905933 3.346640 2.507133
# 2 2.280351 3.444803 2.740641
# 3 2.309401 2.309401 2.757909
#
# , , category3 = First, = metric1.mean, = Second
#
# category2
# category1 A B C
# 1 6.666667 3.000000 3
# 2 4.888889 5.000000 5
# 3 5.666667 6.666667 6
#
# , , category3 = First, = metric1.sd, = Second
#
# category2
# category1 A B C
# 1 2.309401 1.414214 1.154701
# 2 3.179797 3.464102 2.449490
# 3 3.444803 4.163332 3.464102
Upvotes: 0