DTYK
DTYK

Reputation: 1200

Replace largest value in a row with a specific number and replace all other values in that same row based on that largest value using dplyr

I have a data frame consisting of questionnaire data, with each column representing one item on the questionnaire.

The data looks something like the following:

df <- data.frame(Q1a = c(3, 2, 5, 6, 9), Q1b = c(2, 0, -2, 0, 9), Q2a = c(1, 4, 7, 2, 4), 
             Q2b = c(0, 0, -1, 0, 0), Q3a = c(5, 7, 2, 0, 9), Q3b = c(-2, -2, 3, 6, 9),
             Q4a = c(5, 2, 4, 9, 0), Q4b = c(0, 0, -2, -2, -2))

Each questionnaire item has two versions (a, b). I want to select all the items that possess a "b" suffix. For those columns with a "b" suffix, I want to divide the number of cells with a value of -2 against the total number of non-blank and non-NA in that particular column. I want to repeat the above procedure for all columns. I managed to accomplish it with the following code:

test <- df %>%
  select(ends_with("b")) %>%
  mutate_all(funs(round(sum(. == -2)/sum(. != "" | . != NA)*100, 
                        digits = 2)))

As there is no "group_by" equivalent that I know of that works on columns instead of rows, the same results are repeated in every row of the above output. I have managed to remove the rows with the repeated information with slice, using the following code:

test <- df %>%
  select(ends_with("b")) %>%
  mutate_all(funs(round(sum(. == -2)/sum(. != "" | . != NA)*100, 
                        digits = 2))) %>%
  slice(1) 

With the above output, I want to continue my dplyr pipe by replacing the largest value in that row with the value of 1, and all other values as a percentage of the largest value.

What I have is the following:

Before

The output I want is:

After

My two questions are:

1). Is there a group_by equivalent that works on columns? If so, I wouldn't have to use slice in such a clumsy manner.

2). Could someone help me complete my dplyr pipe to the output that I want? I am not sure how to proceed from there.

Thanks!

Upvotes: 1

Views: 506

Answers (2)

MKR
MKR

Reputation: 20085

It's always preferable to work on data in long format. But if 3000+ columns are there in data.frame and OP wants work on data in wide format itself then one solution can be to use dplyr::summarise_all instead of mutate_all as used by OP:

library(dplyr)

df %>%
  select(ends_with("b")) %>%
  summarise_all(funs(round(sum(. == -2)/sum(!is.na(.))*100, 
                        digits = 2))) %>%
  mutate(maxV = max(.)) %>%
  mutate_all(funs(100*./maxV)) %>%
  select(-maxV)

#     Q1b Q2b   Q3b Q4b
# 1 33.33   0 66.67 100

Upvotes: 1

Wimpel
Wimpel

Reputation: 27732

first, gather your data in long format, then filter NA and 0-values. Than you can group and summarise.

library(tidyverse)
df %>%
  select(ends_with("b")) %>%
  gather() %>%
  filter( !value == 0 | is.na( value ) ) %>%
  group_by( key ) %>%
  summarise( output = round( 100 * length( value[value == -2] ) / n(), digits = 1 ) )

# # A tibble: 4 x 2
# key   output
# <chr>  <dbl>
# 1 Q1b     33.3
# 2 Q2b      0  
# 3 Q3b     40  
# 4 Q4b    100

Please note that the outcome of Q3b differs from your desired output. Are you sure the desired output is correct?

Upvotes: 3

Related Questions