Use dplyr to group-by dataset and summarize mean and SD (standard deviation)

Question

I have some python code that uses .groupby and .agg to convert a dataframe into a summary table, and am having trouble converting into R. My desired output looks like this:

Figure 1. Format for desired output

Imagine that the starting dataframe df looks like this:

Y1  Y2  Y3   Sex    X1  X2   X3    X4     X5    X6
1   0   1    Male   52  2    7.25  11.40  0.50  2
0   0   0    Female 42  1    2.00  27.00  1.00  2
1   0   1    Male   46  4    0.08  16.20  0.17  3
0   0   0    Female 60  3    5.65  2.00   1.68  1
1   0   1    Male   81  1    1.37  9.20   0.80  0
0   0   0    Female 44  2    0.87  15.40  1.00  0
1   0   1    Male   61  4    0.87  19.40  0.25  2
0   0   0    Female 46  1    2.00  7.20   1.00  1
1   0   1    Male   56  1    7.25  1.40   0.45  2
0   0   0    Female 54  2    2.00  25.20  1.00  3

I want to be able to convert df to look like Figure 1 in R. So far I've figured out I can use the dplyr package to groupby and summarize a dataset:

df %>%
  group_by(Sex) %>%
  summarize(
  m = mean(X1, na.rm=TRUE),
  sd = sd(X1)
)

However this only gives me the summary for the variable X1 and I need it to be grouped-by Y1, Y2, Y3 and the rest of the X variables.

So how can I code this so it looks like Figure 1?

FWIW, this is more or less the code I used in python but I need it for R.

Y1_ = df.groupby(['Y1','Sex']).agg(['mean','std']).round(2)
Y2_ = df.groupby(['Y2','Sex']).agg(['mean','std']).round(2)
Y3_ = df.groupby(['Y3','Sex']).agg(['mean','std']).round(2)

frames = [Y1_, Y2_, Y3_]
table1 = pd.concat(frames, keys=['Y1','Y2','Y3'], ignore_index=False)

neilfws · Accepted Answer

My answer is similar to that from @juhyeon but (1) I don't combine the Y values with Use and (2) I do some renaming to get output more like the example.

df %>% 
  gather(DepVar, Use, 1:3) %>% 
  mutate(Use = ifelse(Use == 0, "No", "Yes")) %>% 
  group_by(DepVar, Use, Sex) %>% 
  summarise_at(vars(starts_with("X")), list(mean = mean, sd = sd)) %>% 
  select(DepVar, Use, Sex,
         X1_mean, X1_sd, 
         X2_mean, X2_sd, 
         X3_mean, X3_sd, 
         X4_mean, X4_sd, 
         X5_mean, X5_sd, 
         X6_mean, X6_sd)

Result:

# A tibble: 6 x 15
# Groups:   DepVar, Use [5]
  DepVar Use   Sex    X1_mean X1_sd X2_mean X2_sd X3_mean X3_sd X4_mean X4_sd X5_mean X5_sd X6_mean X6_sd
                              
1 Y1     No    Female    49.2  7.56     1.8 0.837    2.50  1.83    15.4 10.9    1.14  0.304     1.4  1.14
2 Y1     Yes   Male      59.2 13.4      2.4 1.52     3.36  3.58    11.5  6.92   0.434 0.246     1.8  1.10
3 Y2     No    Female    49.2  7.56     1.8 0.837    2.50  1.83    15.4 10.9    1.14  0.304     1.4  1.14
4 Y2     No    Male      59.2 13.4      2.4 1.52     3.36  3.58    11.5  6.92   0.434 0.246     1.8  1.10
5 Y3     No    Female    49.2  7.56     1.8 0.837    2.50  1.83    15.4 10.9    1.14  0.304     1.4  1.14
6 Y3     Yes   Male      59.2 13.4      2.4 1.52     3.36  3.58    11.5  6.92   0.434 0.246     1.8  1.10

Use dplyr to group-by dataset and summarize mean and SD (standard deviation)

Answers (2)

Related Questions