Reputation: 75
I have this dataframe:
treatment hh_id hh_size sex yob g2000 g2002 g2004 p2000
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Civic Duty 1 2 1 1941 1 1 1 0
2 Civic Duty 1 2 1 1947 1 1 1 0
3 Hawthorne 2 3 1 1951 1 1 1 0
4 Hawthorne 2 3 1 1950 1 1 1 0
5 Hawthorne 2 3 1 1982 1 1 1 0
6 Control 3 3 1 1981 0 0 1 0
7 Control 3 3 1 1959 1 1 1 0
8 Control 3 3 1 1956 1 1 1 0
9 Control 4 2 1 1968 0 0 1 0
10 Control 4 2 1 1967 1 1 1 0
I want to group it by hh_id & treatment and summarize the rest of the columns by their mean.
Except, I also want two other columns to count the number of males and females in each household, where in the "sex" column female == 1
and male == 0
.
Here's what I have so far:
households <- df %>%
mutate_if(is.character, factor) %>%
group_by(hh_id, treatment) %>%
summarise_if(is.numeric, mean)
View(households)
which gives me this dataframe:
hh_id treatment hh_size sex yob g2000 g2002 g2004 p2000
<dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Civic Duty 2 1 1944 1 1 1 0
2 2 Hawthorne 3 1 1961 1 1 1 0
3 3 Control 3 1 1965. 0.667 0.667 1 0
4 4 Control 2 1 1968. 0.5 0.5 1 0
5 5 Control 1 1 1941 1 1 1 0
6 6 Hawthorne 2 1 1947 1 1 1 0
7 7 Control 1 1 1969 1 0 1 0
8 8 Control 2 1 1964 1 1 1 0.5
9 9 Self 2 1 1956 0.5 0.5 1 0
10 10 Control 1 1 1943 1 1 1 0
Upvotes: 0
Views: 495
Reputation: 886948
Instead of summarise_if
, use summarise
with across
(which is much more flexible). Also, the _if/_at/_all
are deprecated
library(dplyr)
df1 %>%
group_by(hh_id, treatment) %>%
summarise(across(where(is.numeric), mean),
n_female = sum(sex == 1), n_male = sum(sex == 0))
The flexibility is that, we can pass multiple set of columns with difference functions in across
as well as computation on a single column without across
df1 <- structure(list(treatment = c("Civic Duty", "Civic Duty", "Hawthorne",
"Hawthorne", "Hawthorne", "Control", "Control", "Control", "Control",
"Control"), hh_id = c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L),
hh_size = c(2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L), sex = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), yob = c(1941L, 1947L,
1951L, 1950L, 1982L, 1981L, 1959L, 1956L, 1968L, 1967L),
g2000 = c(1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L), g2002 = c(1L,
1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L), g2004 = c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), p2000 = c(0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
Upvotes: 1