user3571389
user3571389

Reputation: 335

Computing the positive response rate for survey dataset using dplyr

I have a large survey dataset which looks as follows:

          trust09 q16a q16b q16c q16f q16g q23e
1         5A3    3    3    3    4    3    3
2         5A3    2    2    2    2    3    2
3         5A3    4    4    4    5    5    5
4         5A3    3    3    2    4    4    3
5         5A3   NA   NA   NA   NA   NA   NA
6         5A3    4    4    4    4    4    3
....
....
159524     TAN    2    2    3    4    4    3
159525     TAN    4    3    2    1    3    3
159526     TAN    4    4    4    4    4    4
159527     TAN    4   NA    4    2    3    4
159528     TAN    4    4    4    4    4    4
159529     TAN    4    4    4    5    4    5

trust09 is the code for the hospital or organisation and the remaining columns are survey questions from strongly disagree to strongly agree and are scored from 1 to 5 respectively.

Each row corresponds a response from a respondent belonging to some hospital.

From this data, I want to calculate the positive response rate or PRR for each survey question for each hospital i.e. the number of respondents that have answered 'Agree' (4) or 'Strongly Agree' (5) and express this is as a percentage over the total no. of respondents.

I can get the total no. of respondents quite easily from the following code:

df0 <- nss08 %>% select(trust09, q16a, q16b, q16c, q16f, q16g, q23e) %>% 
    group_by(trust09) %>%
    summarise_all(funs(length(.)))

Which gives me the following table:

  trust09  q16a  q16b  q16c  q16f  q16g  q23e
<chr> <int> <int> <int> <int> <int> <int>
1     5A3   414   414   414   414   414   414
2     5A4   298   298   298   298   298   298
3     5A5   271   271   271   271   271   271
4     5A7   384   384   384   384   384   384
5     5A8   343   343   343   343   343   343
6     5A9   502   502   502   502   502   502

I can easily count the number of responses for 'Agree'(4) and 'Strongly Agree' (5) for a single survey question using the following code:

df1 <- nss08 %>%
  select(trust09, q16a) %>%
  group_by(trust09) %>%
  filter(q16a == 4|q16a == 5) %>%
  summarise_all(funs(length(.)))

which gives this sample data:

  trust09  q16a
<chr> <int>
1     5A3   124
2     5A4    65
3     5A5   107
4     5A7   142
5     5A8   126
6     5A9   159
....

I also get the same result using:

aggregate(q16a ~ trust09, data = nss08[nss08$q16a == 4|nss08$q16a == 5, ], length)

I then simply merge these two data and calculate the PRR for the variable/question q16a i.e. no. of respondents who answered 'Agree' (4) or 'Strongly Agree' (5) for this question, dividend by total responses for the question and then multiplied by 100.

The problem occurs when I wish to do the same for all the remaining variables simultaneously rather than simply writing one code corresponding to one single variable.

I have tried the following, but I get an error message:

myList <- vector("list", length = length(myVars))

for (x in seq_along(myVars)){
myList[x] <- aggregate(myVars[x] ~ trust09, data = nss08[nss08$myVars[,x] == 4|nss08$myVars[,x] == 5, ], length)}

I have also tried this without any success:

for (x in seq_along(myVars)){
 myList[[x]] <- nss08 %>%
select(trust09, myVars[x]) %>%
group_by(trust09) %>%
filter(myVars[x] == 4|myVars[x] == 5) %>%
summarise(length(myVars[x]))
}

Maybe, you can see from the code what I am trying to do here.

I wanted to know how can you do the whole process more efficiently by using less code and eventually create a data frame that consists of positive response rates for each the variables/survey questions?

Thank you.

Upvotes: 1

Views: 441

Answers (2)

BENY
BENY

Reputation: 323226

Base on your dplyr code , I made this function , you can call it in the for loop or within apply

xx=function(arg){
  var=quo(arg)
  #print(var)
  df1=df %>%
    select(trust09, !!!quos(arg)) %>%
    group_by(trust09) %>%
    filter_(.dots=paste0(arg,'==','4|',arg,'== 5'))%>%
    summarise(length(!!var))
  return(df1)
}



xx('q16a')

<quosure: frame>
~arg
# A tibble: 2 x 2
  trust09 `length(arg)`
    <chr>         <int>
1     5A3             1
2     TAN             1

Upvotes: 2

akuiper
akuiper

Reputation: 214937

Assume your data frame contains trust09 and all other columns correspond to questions you want to summarize, you can use summarize_all and count the number of 4 and 5 responses with sum(col %in% 4:5, na.rm=TRUE) and divide it by length(col) directly:

df %>% group_by(trust09) %>% summarise_all(~ sum(. %in% 4:5, na.rm = T)/length(.))
# here . refers to all other columns individually except the group variable

# A tibble: 2 x 7
#  trust09      q16a      q16b      q16c      q16f      q16g      q23e
#   <fctr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
#1     5A3 0.3333333 0.3333333 0.3333333 0.6666667 0.5000000 0.1666667
#2     TAN 0.8333333 0.5000000 0.6666667 0.6666667 0.6666667 0.6666667

Data used as following:

dput(df)
structure(list(trust09 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("5A3", "TAN"), class = "factor"), 
    q16a = c(3L, 2L, 4L, 3L, NA, 4L, 2L, 4L, 4L, 4L, 4L, 4L), 
    q16b = c(3L, 2L, 4L, 3L, NA, 4L, 2L, 3L, 4L, NA, 4L, 4L), 
    q16c = c(3L, 2L, 4L, 2L, NA, 4L, 3L, 2L, 4L, 4L, 4L, 4L), 
    q16f = c(4L, 2L, 5L, 4L, NA, 4L, 4L, 1L, 4L, 2L, 4L, 5L), 
    q16g = c(3L, 3L, 5L, 4L, NA, 4L, 4L, 3L, 4L, 3L, 4L, 4L), 
    q23e = c(3L, 2L, 5L, 3L, NA, 3L, 3L, 3L, 4L, 4L, 4L, 5L)), .Names = c("trust09", 
"q16a", "q16b", "q16c", "q16f", "q16g", "q23e"), class = "data.frame", row.names = c(NA, 
12L))

Upvotes: 1

Related Questions