Coding challenge: Calculate sum of col values if rows in other col are TRUE (table summary)

Question

We have a table where the first col contains values (protein counts) while the following cols are logical vectors (T or F, indicating if protein_id has the property). For each col we seek the sum of all values where col = T and the count of T.

With example data the task might be better to describe:

[please excuse that the example data require a package for the random id generator, if you know a base R solution please wrote comment and I will include it here].

library("stringi")

value <- c(sample(2:5, 20 , replace=T),
           sample(6:10, 20 , replace=T), 
           sample(1:7, 20 ,  replace=T), 
           sample(3:10, 20 , replace=T), 
           sample(10:20, 20 , replace=T) )

data <- data.frame(
  id = stringi::stri_rand_strings(20, 5),
  value = value,
  nucleus = sample(c(TRUE,FALSE), 20, TRUE),
  membrane = sample(c(TRUE,FALSE), 20, TRUE),
  mitochondria = sample(c(TRUE,FALSE), 20, TRUE))

For each property col we seek the sum of all values and the count of all ids. Next, check if TRUE in multiple cols. If yes: new col with string of all colnames sep by _ and sum of all values. Lastly a col with all ids sep by ;

expected_result_1 <- data.frame(
  property = c('nucleus', 'membrane', 'mitochondria', 'nucleus_ membrane'),
  value_sum = c('x', 'y', 'z', 'w'),
  n_ids = c(4, 3, 1, 2),
  ids = c("MSATv;1NFZ4;Kftq5;JANXo", "htiFJ;kCHtA8;jXXh", "kCHtA", "MSATv_htiFJ"))

A dplyr solutions would be great!

Thank you!

Sebastian

ThomasIsCoding · Accepted Answer

I am not sure whether the code below can give the desired output, but here is a base R attempt.

First, we can define a user function f, which helps to summarize the information by properties in data

f <- function(cols) {
  idx <- rowSums(data[cols]) == length(cols)
  data.frame(
    property = paste0(cols, collapse = "_"),
    value_sum = sum(data$value[idx],na.rm = TRUE),
    n_ids = length(unique(data$id[idx])),
    ids = toString(unique(data$id[idx]))
  )
}

Then, we select the columns (see v is the vector of selected column names), and run the following code

v <- c("nucleus", "membrane", "mitochondria")
output <- do.call(
  rbind,
  unlist(
    lapply(
      seq_along(v),
      function(k) combn(v, k, FUN = f, simplify = FALSE)
    ),
    recursive = FALSE
  )
)

and we will get

> output
                       property value_sum n_ids
1                       nucleus       406    11
2                      membrane       367    10
3                  mitochondria       278     8
4              nucleus_membrane       193     5
5          nucleus_mitochondria       135     4
6         membrane_mitochondria       136     4
7 nucleus_membrane_mitochondria        37     1
                                                                          ids
1 zMknh, TUJhp, QVf8L, P5vps, w4NX6, 2IVbG, AT0RG, SxiO7, ErRUg, 1wIAO, YgefT
2        P5vps, w4NX6, nj3Tv, 2IVbG, xRMA3, eZzb4, ErRUg, l9qwa, SQWq3, YgefT
3                      P5vps, QMw74, eZzb4, AT0RG, SxiO7, l9qwa, 1wIAO, SQWq3
4                                           P5vps, w4NX6, 2IVbG, ErRUg, YgefT
5                                                  P5vps, AT0RG, SxiO7, 1wIAO
6                                                  P5vps, eZzb4, l9qwa, SQWq3
7                                                                       P5vps

Coding challenge: Calculate sum of col values if rows in other col are TRUE (table summary)

Answers (2)

Related Questions