Martingales
Martingales

Reputation: 189

How to combine dplyr group_by, summarise, across and multiple function outputs?

I have the following tibble:

tTest = tibble(Cells = rep(c("C1", "C2", "C3"), times = 3), 
               Gene = rep(c("G1", "G2", "G3"), each = 3), 
               Experiment_score = 1:9, 
               Pattern1 = 1:9, 
               Pattern2 = -(1:9), 
               Pattern3 = 9:1) %>%
        group_by(Gene)

and I would like to correlate the Experiment_score with each of the Pattern columns for all Gene.

Looking at the tidyverse across page and examples, I thought this would work:

# `corList` is a simple wrapper for `cor` to have exactly two outputs:
corList = function(x, y) {
    result = cor.test(x, y)
    return(list(stat = result$estimate, pval = result$p.value))
}

tTest %>% summarise(across(starts_with("Pattern"), ~ corList(Experiment_score, .x), .names = "{.col}_corr_{.fn}"))

but I got this: enter image description here

I have found a solution by melting the Pattern columns and I will post it down below for completeness but the challenge is that I have dozens of Pattern columns and millions of rows. If I melt the Pattern columns, I end up with half a billion rows, seriously hampering my ability to work with the data.

EDIT: My own imperfect solution:

# `corVect` is a simple wrapper for `cor` to have exactly two outputs:
corVect = function(x, y) {
    result = cor.test(x, y)
    return(c(stat = result$estimate, pval = result$p.value))
}

tTest %>% pivot_longer(starts_with("Pattern"), names_to = "Pattern", values_to = "Strength") %>%
      group_by(Gene, Pattern) %>%
      summarise(CorrVal = corVect(Experiment_score, Strength)) %>% 
      mutate(CorrType = c("corr", "corr_pval")) %>%
      # Reformat
      pivot_wider(id_cols = c(Gene, Pattern), names_from = CorrType, values_from = CorrVal)

Upvotes: 1

Views: 279

Answers (1)

Andy Baxter
Andy Baxter

Reputation: 7626

To get the desired result in one step, wrap the function return as a tibble rather than a list, and call .unpack = TRUE in across. Here using a conveniently-named corTibble function:

library(tidyverse)

tTest = tibble(
  Cells = rep(c("C1", "C2", "C3"), times = 3),
  Gene = rep(c("G1", "G2", "G3"), each = 3),
  Experiment_score = 1:9,
  Pattern1 = 1:9 + rnorm(9),  # added some noise
  Pattern2 = -(1:9 + rnorm(9)),
  Pattern3 = 9:1 + rnorm(9)
) %>%
  group_by(Gene)

corTibble = function(x, y) {
  result = cor.test(x, y)
  return(tibble(stat = result$estimate, pval = result$p.value))
}

tTest %>% summarise(across(
  starts_with("Pattern"),
  ~ corTibble(Experiment_score, .x),
  .names = "{.col}_corr",
  .unpack = TRUE
))

#> # A tibble: 3 × 7
#>   Gene  Pattern1_corr_stat Pattern1_corr_pval Pattern2…¹ Patte…² Patte…³ Patte…⁴
#>   <chr>              <dbl>              <dbl>      <dbl>   <dbl>   <dbl>   <dbl>
#> 1 G1                 0.947             0.208      -0.991  0.0866  -1.00   0.0187
#> 2 G2                 0.964             0.172      -0.872  0.325   -0.981  0.126 
#> 3 G3                 0.995             0.0668     -0.680  0.524   -0.409  0.732 
#> # … with abbreviated variable names ¹​Pattern2_corr_stat, ²​Pattern2_corr_pval,
#> #   ³​Pattern3_corr_stat, ⁴​Pattern3_corr_pval

Upvotes: 1

Related Questions