Reputation: 1200
I have a data frame consisting of questionnaire data, with each column representing one item on the questionnaire.
The data looks something like the following:
df <- data.frame(Q1a = c(3, 2, 5, 6, 9), Q1b = c(2, 0, -2, 0, 9), Q2a = c(1, 4, 7, 2, 4),
Q2b = c(0, 0, -1, 0, 0), Q3a = c(5, 7, 2, 0, 9), Q3b = c(-2, -2, 3, 6, 9),
Q4a = c(5, 2, 4, 9, 0), Q4b = c(0, 0, -2, -2, -2))
Each questionnaire item has two versions (a, b). I want to select all the items that possess a "b" suffix. For those columns with a "b" suffix, I want to divide the number of cells with a value of -2 against the total number of non-blank and non-NA in that particular column. I want to repeat the above procedure for all columns. I managed to accomplish it with the following code:
test <- df %>%
select(ends_with("b")) %>%
mutate_all(funs(round(sum(. == -2)/sum(. != "" | . != NA)*100,
digits = 2)))
As there is no "group_by" equivalent that I know of that works on columns instead of rows, the same results are repeated in every row of the above output. I have managed to remove the rows with the repeated information with slice
, using the following code:
test <- df %>%
select(ends_with("b")) %>%
mutate_all(funs(round(sum(. == -2)/sum(. != "" | . != NA)*100,
digits = 2))) %>%
slice(1)
With the above output, I want to continue my dplyr pipe by replacing the largest value in that row with the value of 1, and all other values as a percentage of the largest value.
What I have is the following:
The output I want is:
My two questions are:
1). Is there a group_by equivalent that works on columns? If so, I wouldn't have to use slice
in such a clumsy manner.
2). Could someone help me complete my dplyr pipe to the output that I want? I am not sure how to proceed from there.
Thanks!
Upvotes: 1
Views: 506
Reputation: 20085
It's always preferable to work on data in long format. But if 3000+
columns are there in data.frame and OP wants work on data in wide format itself then one solution can be to use dplyr::summarise_all
instead of mutate_all
as used by OP:
library(dplyr)
df %>%
select(ends_with("b")) %>%
summarise_all(funs(round(sum(. == -2)/sum(!is.na(.))*100,
digits = 2))) %>%
mutate(maxV = max(.)) %>%
mutate_all(funs(100*./maxV)) %>%
select(-maxV)
# Q1b Q2b Q3b Q4b
# 1 33.33 0 66.67 100
Upvotes: 1
Reputation: 27732
first, gather your data in long format, then filter NA and 0-values. Than you can group and summarise.
library(tidyverse)
df %>%
select(ends_with("b")) %>%
gather() %>%
filter( !value == 0 | is.na( value ) ) %>%
group_by( key ) %>%
summarise( output = round( 100 * length( value[value == -2] ) / n(), digits = 1 ) )
# # A tibble: 4 x 2
# key output
# <chr> <dbl>
# 1 Q1b 33.3
# 2 Q2b 0
# 3 Q3b 40
# 4 Q4b 100
Please note that the outcome of Q3b differs from your desired output. Are you sure the desired output is correct?
Upvotes: 3