Reputation: 438
I have a dataset like this :
id = rep(paste("id",1:7,sep=""),c(5,2,4,2,3,1,4))
county=rep(c("A","B","C","B","A","D","C"),c(5,2,4,2,3,1,4))
year =c(2011:2015,2012:2013,2011:2014,2013:2014,2011:2013,2014,2012:2015)
age= sample(30:50,size=length(id),replace=T)
race=sample(c("white","Black","Other"),size=length(id),replace=T)
df=data.frame(id=id,county=county,year=year,age=age,race=race)
id county year age race
1 id1 A 2011 32 white
2 id1 A 2012 48 Black
3 id1 A 2013 50 Other
4 id1 A 2014 37 white
5 id1 A 2015 32 white
6 id2 B 2012 48 Black
7 id2 B 2013 48 Other
8 id3 C 2011 40 Other
9 id3 C 2012 33 Other
10 id3 C 2013 42 white
11 id3 C 2014 33 Other
12 id4 B 2013 43 Other
13 id4 B 2014 33 Black
14 id5 A 2011 50 Black
15 id5 A 2012 43 Other
16 id5 A 2013 41 white
17 id6 D 2014 37 Black
18 id7 C 2012 32 white
19 id7 C 2013 31 Other
20 id7 C 2014 34 Other
21 id7 C 2015 35 Other
I want group by county, year and get mean of age and percent of race categories. For age it can be done by
library(dplyr)
df %>% group_by(county,year) %>% summarise(avgage=mean(age))
But how to get percent of 3 categories of race by groups?
TIA!
Upvotes: 1
Views: 771
Reputation: 9878
First I would recommend using pivot_wider
to create a variable for every race.
Then we can use summarise
with mean
for the age mean. For the proportions of the races, use scales::percent(mean(variable))
across
all races.
library(tidyr)
library(dplyr)
df %>% pivot_wider(names_from = race,
values_from = race,
values_fn = \(x)1,
values_fill = 0)%>%
group_by(county, year)%>%
summarise(mean_age = mean(age),
across(Black:Other, ~scales::percent(mean(.x), 0.01)))
# A tibble: 14 × 6
# Groups: county [4]
county year mean_age Black white Other
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 A 2011 41 50.00% 0.00% 50.00%
2 A 2012 43 0.00% 100.00% 0.00%
3 A 2013 42 0.00% 0.00% 100.00%
4 A 2014 36 0.00% 0.00% 100.00%
5 A 2015 49 0.00% 100.00% 0.00%
6 B 2012 36 0.00% 0.00% 100.00%
7 B 2013 42 50.00% 0.00% 50.00%
8 B 2014 45 0.00% 0.00% 100.00%
9 C 2011 41 100.00% 0.00% 0.00%
10 C 2012 39 0.00% 50.00% 50.00%
11 C 2013 42 0.00% 50.00% 50.00%
12 C 2014 34.5 0.00% 50.00% 50.00%
13 C 2015 33 0.00% 0.00% 100.00%
14 D 2014 36 0.00% 0.00% 100.00%
We can shorten the first step by replacing pivot_wider
with a dummyfying function from another package. Here I used fastDummies
:
library(fastDummies)
library(dplyr)
df %>%
dummy_cols('race', remove_selected_columns = TRUE)%>%
group_by(county, year)%>%
summarise(mean_age = mean(age),
across(starts_with('race_'),
~scales::percent(mean(.x), 0.01)))
Upvotes: 1