Reputation: 335
I have a large survey dataset which looks as follows:
trust09 q16a q16b q16c q16f q16g q23e
1 5A3 3 3 3 4 3 3
2 5A3 2 2 2 2 3 2
3 5A3 4 4 4 5 5 5
4 5A3 3 3 2 4 4 3
5 5A3 NA NA NA NA NA NA
6 5A3 4 4 4 4 4 3
....
....
159524 TAN 2 2 3 4 4 3
159525 TAN 4 3 2 1 3 3
159526 TAN 4 4 4 4 4 4
159527 TAN 4 NA 4 2 3 4
159528 TAN 4 4 4 4 4 4
159529 TAN 4 4 4 5 4 5
trust09 is the code for the hospital or organisation and the remaining columns are survey questions from strongly disagree to strongly agree and are scored from 1 to 5 respectively.
Each row corresponds a response from a respondent belonging to some hospital.
From this data, I want to calculate the positive response rate or PRR for each survey question for each hospital i.e. the number of respondents that have answered 'Agree' (4) or 'Strongly Agree' (5) and express this is as a percentage over the total no. of respondents.
I can get the total no. of respondents quite easily from the following code:
df0 <- nss08 %>% select(trust09, q16a, q16b, q16c, q16f, q16g, q23e) %>%
group_by(trust09) %>%
summarise_all(funs(length(.)))
Which gives me the following table:
trust09 q16a q16b q16c q16f q16g q23e
<chr> <int> <int> <int> <int> <int> <int>
1 5A3 414 414 414 414 414 414
2 5A4 298 298 298 298 298 298
3 5A5 271 271 271 271 271 271
4 5A7 384 384 384 384 384 384
5 5A8 343 343 343 343 343 343
6 5A9 502 502 502 502 502 502
I can easily count the number of responses for 'Agree'(4) and 'Strongly Agree' (5) for a single survey question using the following code:
df1 <- nss08 %>%
select(trust09, q16a) %>%
group_by(trust09) %>%
filter(q16a == 4|q16a == 5) %>%
summarise_all(funs(length(.)))
which gives this sample data:
trust09 q16a
<chr> <int>
1 5A3 124
2 5A4 65
3 5A5 107
4 5A7 142
5 5A8 126
6 5A9 159
....
I also get the same result using:
aggregate(q16a ~ trust09, data = nss08[nss08$q16a == 4|nss08$q16a == 5, ], length)
I then simply merge these two data and calculate the PRR for the variable/question q16a i.e. no. of respondents who answered 'Agree' (4) or 'Strongly Agree' (5) for this question, dividend by total responses for the question and then multiplied by 100.
The problem occurs when I wish to do the same for all the remaining variables simultaneously rather than simply writing one code corresponding to one single variable.
I have tried the following, but I get an error message:
myList <- vector("list", length = length(myVars))
for (x in seq_along(myVars)){
myList[x] <- aggregate(myVars[x] ~ trust09, data = nss08[nss08$myVars[,x] == 4|nss08$myVars[,x] == 5, ], length)}
I have also tried this without any success:
for (x in seq_along(myVars)){
myList[[x]] <- nss08 %>%
select(trust09, myVars[x]) %>%
group_by(trust09) %>%
filter(myVars[x] == 4|myVars[x] == 5) %>%
summarise(length(myVars[x]))
}
Maybe, you can see from the code what I am trying to do here.
I wanted to know how can you do the whole process more efficiently by using less code and eventually create a data frame that consists of positive response rates for each the variables/survey questions?
Thank you.
Upvotes: 1
Views: 441
Reputation: 323226
Base on your dplyr
code , I made this function , you can call it in the for
loop
or within apply
xx=function(arg){
var=quo(arg)
#print(var)
df1=df %>%
select(trust09, !!!quos(arg)) %>%
group_by(trust09) %>%
filter_(.dots=paste0(arg,'==','4|',arg,'== 5'))%>%
summarise(length(!!var))
return(df1)
}
xx('q16a')
<quosure: frame>
~arg
# A tibble: 2 x 2
trust09 `length(arg)`
<chr> <int>
1 5A3 1
2 TAN 1
Upvotes: 2
Reputation: 214937
Assume your data frame contains trust09
and all other columns correspond to questions you want to summarize, you can use summarize_all
and count the number of 4
and 5
responses with sum(col %in% 4:5, na.rm=TRUE)
and divide it by length(col)
directly:
df %>% group_by(trust09) %>% summarise_all(~ sum(. %in% 4:5, na.rm = T)/length(.))
# here . refers to all other columns individually except the group variable
# A tibble: 2 x 7
# trust09 q16a q16b q16c q16f q16g q23e
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 5A3 0.3333333 0.3333333 0.3333333 0.6666667 0.5000000 0.1666667
#2 TAN 0.8333333 0.5000000 0.6666667 0.6666667 0.6666667 0.6666667
Data used as following:
dput(df)
structure(list(trust09 = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("5A3", "TAN"), class = "factor"),
q16a = c(3L, 2L, 4L, 3L, NA, 4L, 2L, 4L, 4L, 4L, 4L, 4L),
q16b = c(3L, 2L, 4L, 3L, NA, 4L, 2L, 3L, 4L, NA, 4L, 4L),
q16c = c(3L, 2L, 4L, 2L, NA, 4L, 3L, 2L, 4L, 4L, 4L, 4L),
q16f = c(4L, 2L, 5L, 4L, NA, 4L, 4L, 1L, 4L, 2L, 4L, 5L),
q16g = c(3L, 3L, 5L, 4L, NA, 4L, 4L, 3L, 4L, 3L, 4L, 4L),
q23e = c(3L, 2L, 5L, 3L, NA, 3L, 3L, 3L, 4L, 4L, 4L, 5L)), .Names = c("trust09",
"q16a", "q16b", "q16c", "q16f", "q16g", "q23e"), class = "data.frame", row.names = c(NA,
12L))
Upvotes: 1