Reputation: 11
I would like to generate a table out of a dataset. Of the variables to be displayed two are strings and the other three are numeric. The numeric variables contain information in absolute numbers. Next to the display of these absolute numbers in the table (sum of students starting at different universities), I want to display the respective shares of women as percentages. The dataset contains the sums as well as the shares of women in total numbers as separate variables (in columns).
Grouped by x1 and x2, the table should contain columns containing total sums (x3a, x4a, x5a) and shares in percent of female students (x3b, x4b, x5b).
As the dataset contains the total sums in a variable, I think I just need to add the respective variable into the code somewhere after some sort of grouping functions, but the shares still need to be calculated and then printed into a new variable/column and I seem to can't figure out a code for the whole table. I know there should be some group_by, summarise and mutate functions and the strings have to be factorised to get the code running, but I haven't found a solution yet.
Any help would be greatly appreciated!
That's how it should look like:
library(dplyr)
df=data.frame(
x1=c("chr","chr","chr","chr","chr","chr","chr","chr","chr","chr"),
x2=c("chr","chr","chr","chr","chr","chr","chr","chr","chr","chr"),
x3=c(1,0,0,NA,0,1,1,NA,0,1), #x3=year of university start
#x3a=containing total number (of students starting university)
#x3b=containing percentage of female students, calculated on x3a
x4=c(1,1,NA,1,1,0,NA,NA,0,1),#x4=year of university start
#x4a
#x4b
x5=c(1,0,NA,1,0,0,NA,0,0,1)) #x5=year of university start
#x5a
#x5b
> df
I tried summarise and mutate functions but didn't manage to get the described table with correctly calculated shares.
Upvotes: 1
Views: 69
Reputation: 101
I'm not sure I understood the question correctly because your data example is overly generic. I hope that my answer is related to what you want and could be useful!
First, I reviewed the example dataset, considering that one of the character strings represents the names of the Universities and the other one a qualitative variable referencing women. Here is the dput
:
df <- structure(list(W_or_not = c("chr", "W", "chr", "W", "chr", "W",
"chr", "chr", "chr", "chr"), Univ = c("univ1", "univ1", "univ1",
"univ1", "univ1", "univ2", "univ2", "univ2", "univ2", "univ2"
), year1 = c(1, 0, 0, NA, 0, 1, 1, NA, 0, 1), year2 = c(1, 1,
NA, 1, 1, 0, NA, NA, 0, 1), year3 = c(1, 0, NA, 1, 0, 0, NA,
0, 0, 1)), class = "data.frame", row.names = c(NA, -10L))
head(df)
W_or_not Univ year1 year2 year3
1 chr univ1 1 1 1
2 W univ1 0 1 0
3 chr univ1 0 NA NA
4 W univ1 NA 1 1
5 chr univ1 0 1 0
6 W univ2 1 0 0
7 chr univ2 1 NA NA
8 chr univ2 NA NA 0
9 chr univ2 0 0 0
10 chr univ2 1 1 1
I suggest that you compute the sum of students per university (first table), then the sum of women per university (second table). Finally, you can update your first table by computing the percentage of women per university, as follow:
### First table
df_sum_univ <- df %>%
group_by(Univ) %>%
summarise_if(is.numeric, .funs = ~ sum(.x, na.rm = TRUE)) %>%
rename_if(is.numeric, ~paste0('sum_', .))
### Second one
df_sum_univ_F <- df %>%
filter(W_or_not == 'W') %>%
group_by(Univ) %>%
summarise_if(is.numeric, .funs = ~ sum(.x, na.rm = TRUE)) %>%
rename_if(is.numeric, ~paste0('perc_w_', .))
### Updating the first one in a new table
df_final <- df_sum_univ %>%
bind_cols(df_sum_univ_F[,2:4]/df_sum_univ[,2:4]*100)
head(df_final)
# A tibble: 2 × 7
Univ sum_year1 sum_year2 sum_year3 perc_w_year1 perc_w_year2 perc_w_year3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 univ1 1 4 2 0 50 50
2 univ2 3 1 1 33.3 0 0
Upvotes: 0