Reputation: 4317
I have some python code that uses .groupby
and .agg
to convert a dataframe into a summary table, and am having trouble converting into R. My desired output looks like this:
Figure 1. Format for desired output
Imagine that the starting dataframe df
looks like this:
Y1 Y2 Y3 Sex X1 X2 X3 X4 X5 X6
1 0 1 Male 52 2 7.25 11.40 0.50 2
0 0 0 Female 42 1 2.00 27.00 1.00 2
1 0 1 Male 46 4 0.08 16.20 0.17 3
0 0 0 Female 60 3 5.65 2.00 1.68 1
1 0 1 Male 81 1 1.37 9.20 0.80 0
0 0 0 Female 44 2 0.87 15.40 1.00 0
1 0 1 Male 61 4 0.87 19.40 0.25 2
0 0 0 Female 46 1 2.00 7.20 1.00 1
1 0 1 Male 56 1 7.25 1.40 0.45 2
0 0 0 Female 54 2 2.00 25.20 1.00 3
I want to be able to convert df
to look like Figure 1 in R. So far I've figured out I can use the dplyr
package to groupby and summarize a dataset:
df %>%
group_by(Sex) %>%
summarize(
m = mean(X1, na.rm=TRUE),
sd = sd(X1)
)
However this only gives me the summary for the variable X1 and I need it to be grouped-by Y1, Y2, Y3 and the rest of the X variables.
So how can I code this so it looks like Figure 1?
FWIW, this is more or less the code I used in python but I need it for R.
Y1_ = df.groupby(['Y1','Sex']).agg(['mean','std']).round(2)
Y2_ = df.groupby(['Y2','Sex']).agg(['mean','std']).round(2)
Y3_ = df.groupby(['Y3','Sex']).agg(['mean','std']).round(2)
frames = [Y1_, Y2_, Y3_]
table1 = pd.concat(frames, keys=['Y1','Y2','Y3'], ignore_index=False)
Upvotes: 1
Views: 5511
Reputation: 33782
My answer is similar to that from @juhyeon but (1) I don't combine the Y values with Use
and (2) I do some renaming to get output more like the example.
df %>%
gather(DepVar, Use, 1:3) %>%
mutate(Use = ifelse(Use == 0, "No", "Yes")) %>%
group_by(DepVar, Use, Sex) %>%
summarise_at(vars(starts_with("X")), list(mean = mean, sd = sd)) %>%
select(DepVar, Use, Sex,
X1_mean, X1_sd,
X2_mean, X2_sd,
X3_mean, X3_sd,
X4_mean, X4_sd,
X5_mean, X5_sd,
X6_mean, X6_sd)
Result:
# A tibble: 6 x 15
# Groups: DepVar, Use [5]
DepVar Use Sex X1_mean X1_sd X2_mean X2_sd X3_mean X3_sd X4_mean X4_sd X5_mean X5_sd X6_mean X6_sd
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Y1 No Female 49.2 7.56 1.8 0.837 2.50 1.83 15.4 10.9 1.14 0.304 1.4 1.14
2 Y1 Yes Male 59.2 13.4 2.4 1.52 3.36 3.58 11.5 6.92 0.434 0.246 1.8 1.10
3 Y2 No Female 49.2 7.56 1.8 0.837 2.50 1.83 15.4 10.9 1.14 0.304 1.4 1.14
4 Y2 No Male 59.2 13.4 2.4 1.52 3.36 3.58 11.5 6.92 0.434 0.246 1.8 1.10
5 Y3 No Female 49.2 7.56 1.8 0.837 2.50 1.83 15.4 10.9 1.14 0.304 1.4 1.14
6 Y3 Yes Male 59.2 13.4 2.4 1.52 3.36 3.58 11.5 6.92 0.434 0.246 1.8 1.10
Upvotes: 1
Reputation: 456
This is your data.
db <- structure(list(Y1 = c(1, 0, 1, 0, 1, 0, 1, 0, 1, 0), Y2 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), Y3 = c(1, 0, 1, 0, 1, 0, 1, 0, 1,
0), Sex = c("Male", "Female", "Male", "Female", "Male", "Female",
"Male", "Female", "Male", "Female"), X1 = c(52, 42, 46, 60, 81,
44, 61, 46, 56, 54), X2 = c(2, 1, 4, 3, 1, 2, 4, 1, 1, 2), X3 = c(7.25,
2, 0.08, 5.65, 1.37, 0.87, 0.87, 2, 7.25, 2), X4 = c(11.4, 27,
16.2, 2, 9.2, 15.4, 19.4, 7.2, 1.4, 25.2), X5 = c(0.5, 1, 0.17,
1.68, 0.8, 1, 0.25, 1, 0.45, 1), X6 = c(2, 2, 3, 1, 0, 0, 2,
1, 2, 3)), row.names = c(NA, -10L), class = c("tbl_df", "tbl",
"data.frame"))
This is the code. I thought the Y1, Y2, and Y3 should be changed from wide to long format. That's why I use gather function at first.
db_pro1 <- db %>%
gather(y, value, starts_with("Y")) %>%
mutate(y_value = paste0(y, "-" ,value)) %>%
group_by(y_value, Sex) %>%
summarise_at(vars(starts_with("X")), funs(mean = mean(.), sd = sd(.)))
# A tibble: 6 x 14
# Groups: y_value [5]
y_value Sex X1_mean X2_mean X3_mean X4_mean X5_mean X6_mean X1_sd X2_sd X3_sd
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Y1-0 Fema~ 49.2 1.8 2.50 15.4 1.14 1.4 7.56 0.837 1.83
2 Y1-1 Male 59.2 2.4 3.36 11.5 0.434 1.8 13.4 1.52 3.58
3 Y2-0 Fema~ 49.2 1.8 2.50 15.4 1.14 1.4 7.56 0.837 1.83
4 Y2-0 Male 59.2 2.4 3.36 11.5 0.434 1.8 13.4 1.52 3.58
5 Y3-0 Fema~ 49.2 1.8 2.50 15.4 1.14 1.4 7.56 0.837 1.83
6 Y3-1 Male 59.2 2.4 3.36 11.5 0.434 1.8 13.4 1.52 3.58
# ... with 3 more variables: X4_sd <dbl>, X5_sd <dbl>, X6_sd <dbl>
Upvotes: 2