How to perform operation between grouped values from two data frames

Question

I have two data frames:

src_tbl <- structure(list(Sample_name = c("S1", "S2", "S1", "S2", "S1", 
"S2"), crt = c(0.079, 0.082, 0.079, 0.082, 0.079, 0.082), sr = c(0.592, 
0.549, 0.592, 0.549, 0.592, 0.549), condition = c("x1", "x1", 
"x2", "x2", "x3", "x3"), score = c("0.077", "0.075", "0.483", 
"0.268", "0.555", "0.120")), row.names = c(NA, -6L), .Names = c("Sample_name", 
"crt", "sr", "condition", "score"), class = c("tbl_df", 
"tbl", "data.frame"))
src_tbl
#>   Sample_name   crt    sr condition score
#> 1          S1 0.079 0.592        x1 0.077
#> 2          S2 0.082 0.549        x1 0.075
#> 3          S1 0.079 0.592        x2 0.483
#> 4          S2 0.082 0.549        x2 0.268
#> 5          S1 0.079 0.592        x3 0.555
#> 6          S2 0.082 0.549        x3 0.120

ref_tbl <- structure(list(Sample_name = c("P1", "P2", "P3", "P1", "P2", 
"P3", "P1", "P2", "P3"), crt = c(1, 1, 1, 1, 1, 1, 1, 1, 1), 
    sr = c(2, 2, 2, 2, 2, 2, 2, 2, 2), condition = c("r1", "r1", 
    "r1", "r2", "r2", "r2", "r3", "r3", "r3"), score = c("0.200", 
    "0.201", "0.199", "0.200", "0.202", "0.200", "0.200", "0.204", 
    "0.197")), row.names = c(NA, -9L), .Names = c("Sample_name", 
"crt", "sr", "condition", "score"), class = c("tbl_df", 
"tbl", "data.frame"))
ref_tbl
#>   Sample_name crt sr condition score
#> 1          P1   1  2        r1 0.200
#> 2          P2   1  2        r1 0.201
#> 3          P3   1  2        r1 0.199
#> 4          P1   1  2        r2 0.200
#> 5          P2   1  2        r2 0.202
#> 6          P3   1  2        r2 0.200
#> 7          P1   1  2        r3 0.200
#> 8          P2   1  2        r3 0.204
#> 9          P3   1  2        r3 0.197

What I want to do is to perform operation (ks.test()) on score columns grouped by Sample_name in both data frame. For example the p-value of KS test for S1 and P1 is:

# in src_tbl
s1 <- c(0.077,0.483,0.555)
#in ref_tbl
p1 <- c(0.200,0.200,0.200)
testout <- ks.test(s1,p1)
#> Warning in ks.test(s1, p1): cannot compute exact p-value with ties
broom::tidy(testout)
#>   statistic   p.value                             method alternative
#> 1 0.6666667 0.5175508 Two-sample Kolmogorov-Smirnov test   two-sided

What I want to do is to perform all against all operation so that in the end, we get table like this

src  ref   p.value
S1   P1    0.5175508
S1   P2    0.6
S1   P3    0.6
S2   P1    0.5175508
S2   P2    0.6
S2   P3    0.6

How can I do that? Preferrable to be fast because the number of samples in ref_table could be large (P1, P2 .... P10k).

FlorianGD · Accepted Answer

Here is a solution in the tidyverse. I first nest the score in each source dataset :

ref_tbl <- ref_tbl %>% 
  mutate(ref = as.factor(Sample_name),
         score_ref = as.numeric(score)) %>%
  select(ref, score_ref) %>%
  tidyr::nest(score_ref)

ref_tbl
# A tibble: 3 x 2
     ref                    data
                    
1     P1 
2     P2 
3     P3 

src_tbl <- src_tbl %>% 
  mutate(src = as.factor(Sample_name),
         score_src = as.numeric(score))  %>% 
  select(src, score_src) %>% 
  tidyr::nest(score_src)

src_tbl  
# A tibble: 2 x 2
     src                    data
                    
1     S1 
2     S2

Then I create a grid with all the combinations of sample names :

all_comb <- as_data_frame(expand.grid(src = src_tbl$src, ref = ref_tbl$ref))

all_comb
# A tibble: 6 x 2
     src    ref
   
1     S1     P1
2     S2     P1
3     S1     P2
4     S2     P2
5     S1     P3
6     S2     P3

Now, we can join with the nested data, and I bind the columns so have to have a single list column with the scores, for each combination.

all_comb <- all_comb %>% 
  left_join(ref_tbl, by = "ref") %>% 
  left_join(src_tbl, by = "src") %>%
  mutate(data = purrr::map2(data.x, data.y, bind_cols)) %>%
  select(-data.x, -data.y)

all_comb 
# A tibble: 6 x 3
     src    ref                    data
                     
1     S1     P1 
2     S2     P1 
3     S1     P2 
4     S2     P2 
5     S1     P3 
6     S2     P3

Finally, I map ks.test if each data set, use broom to get the p.value as requested.

final <- all_comb %>%
  mutate(ks = purrr::map(data,  ~ks.test(.$score_ref, .$score_src)),
  tidied = purrr::map(ks, broom::tidy)) %>%
  tidyr::unnest(tidied) %>%
  select(src, ref, p.value)
Warning message: cannot compute exact p-value with ties
Warning message: cannot compute exact p-value with ties

final
# A tibble: 6 x 3
     src    ref   p.value
        
1     S1     P1 0.5175508
2     S2     P1 0.5175508
3     S1     P2 0.6000000
4     S2     P2 0.6000000
5     S1     P3 0.6000000
6     S2     P3 0.6000000

How to perform operation between grouped values from two data frames

Answers (2)

Related Questions