Triparna Poddar
Triparna Poddar

Reputation: 438

Percentage of categorical variable grouping by different variables in R

I have a dataset like this :

id = rep(paste("id",1:7,sep=""),c(5,2,4,2,3,1,4))
county=rep(c("A","B","C","B","A","D","C"),c(5,2,4,2,3,1,4))
year =c(2011:2015,2012:2013,2011:2014,2013:2014,2011:2013,2014,2012:2015)
age= sample(30:50,size=length(id),replace=T)
race=sample(c("white","Black","Other"),size=length(id),replace=T)
df=data.frame(id=id,county=county,year=year,age=age,race=race)

    id county year age  race
1  id1      A 2011  32 white
2  id1      A 2012  48 Black
3  id1      A 2013  50 Other
4  id1      A 2014  37 white
5  id1      A 2015  32 white
6  id2      B 2012  48 Black
7  id2      B 2013  48 Other
8  id3      C 2011  40 Other
9  id3      C 2012  33 Other
10 id3      C 2013  42 white
11 id3      C 2014  33 Other
12 id4      B 2013  43 Other
13 id4      B 2014  33 Black
14 id5      A 2011  50 Black
15 id5      A 2012  43 Other
16 id5      A 2013  41 white
17 id6      D 2014  37 Black
18 id7      C 2012  32 white
19 id7      C 2013  31 Other
20 id7      C 2014  34 Other
21 id7      C 2015  35 Other

I want group by county, year and get mean of age and percent of race categories. For age it can be done by

library(dplyr)
df %>% group_by(county,year) %>% summarise(avgage=mean(age))

But how to get percent of 3 categories of race by groups?

TIA!

Upvotes: 1

Views: 771

Answers (1)

GuedesBF
GuedesBF

Reputation: 9878

First I would recommend using pivot_widerto create a variable for every race. Then we can use summarise with mean for the age mean. For the proportions of the races, use scales::percent(mean(variable)) across all races.

library(tidyr)
library(dplyr)

df %>% pivot_wider(names_from = race,
                   values_from = race,
                   values_fn = \(x)1,
                   values_fill = 0)%>%
    group_by(county, year)%>%
    summarise(mean_age = mean(age),
              across(Black:Other, ~scales::percent(mean(.x), 0.01)))

# A tibble: 14 × 6
# Groups:   county [4]
   county  year mean_age Black   white   Other  
   <chr>  <dbl>    <dbl> <chr>   <chr>   <chr>  
 1 A       2011     41   50.00%  0.00%   50.00% 
 2 A       2012     43   0.00%   100.00% 0.00%  
 3 A       2013     42   0.00%   0.00%   100.00%
 4 A       2014     36   0.00%   0.00%   100.00%
 5 A       2015     49   0.00%   100.00% 0.00%  
 6 B       2012     36   0.00%   0.00%   100.00%
 7 B       2013     42   50.00%  0.00%   50.00% 
 8 B       2014     45   0.00%   0.00%   100.00%
 9 C       2011     41   100.00% 0.00%   0.00%  
10 C       2012     39   0.00%   50.00%  50.00% 
11 C       2013     42   0.00%   50.00%  50.00% 
12 C       2014     34.5 0.00%   50.00%  50.00% 
13 C       2015     33   0.00%   0.00%   100.00%
14 D       2014     36   0.00%   0.00%   100.00%

We can shorten the first step by replacing pivot_wider with a dummyfying function from another package. Here I used fastDummies:

library(fastDummies)
library(dplyr)

df %>%
    dummy_cols('race', remove_selected_columns = TRUE)%>%
    group_by(county, year)%>%
    summarise(mean_age = mean(age),
              across(starts_with('race_'),
                     ~scales::percent(mean(.x), 0.01)))

Upvotes: 1

Related Questions