jaydoc
jaydoc

Reputation: 79

R - maximum value of variables when compared between levels of variable1 grouped by variable2

Consider the following data

set.seed(123)

example.df <- data.frame( 
gene = sample(c("A", "B", "C", "D"), 100, replace = TRUE),
treated = sample(c("Yes", "No"), 100, replace = TRUE), 
resp=rnorm(100, 10,5), effect = rnorm (100, 25, 5))

I am trying to get the maximum value for all variables when they are compared by the levels of gene and grouped by treated. I can create the gene combinations like so,

combn(sort(unique(example.df$gene)), 2, simplify = T)

#     [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] A    A    A    B    B    c   
#[2,] B    c    D    c    D    D   
#Levels: A B c D

Edit: The output I am looking for is a dataframe like this

comparison   group    max.resp    max.effect
A-B          no       value1      value2
....
C-D          no       valueX      valueY
A-B          yes      value3      value4 
.... 
C-D          yes      valueXX     valueYY

While I am able to get the max values for each individual gene level grouped by treated...

max.df <- example.df %>% 
           group_by(treated, gene) %>% 
           nest() %>% 
           mutate(mod = map(data, ~summarise_if(.x, is.numeric, max, na.rm = TRUE))) %>% 
           select(treated, gene, mod) %>% 
           unnest(mod) %>% 
           arrange(treated, gene)

Despite trying to tackle the issue for more than a day, I cannot figure out how to get the max for each numeric variable for each 2 level gene comparison (A vs B, A vs C, A vs D, B vs C, B vs D, and C vs D) grouped by treated.

Any help is appreciated. Thanks.

Upvotes: 0

Views: 81

Answers (1)

Derek Corcoran
Derek Corcoran

Reputation: 4082

I found a solution, it might be a little messy, but I will update it in a better way, it takes no time whatsoever

library(tidyverse)

First I generate a dataframe with two columns, Gen1 and Gen2 for al possible comparisons, very similar to your use of combn but creating a data.frame

GeneComp <- expand.grid(Gen1 = unique(example.df$gene), Gen2 = unique(example.df$gene)) %>% filter(Gen1 != Gen2) %>% arrange(Gen1)

Then I loop throught it grouping by

Comps <- list()
for(i in 1:nrow(GeneComp)){
  Comps[[i]] <- example.df %>% filter(gene == GeneComp[i,]$Gen1 | gene == GeneComp[i,]$Gen2) %>% # This line filters only the data with genes in the ith row
  group_by(treated) %>% # Then gorup by treated
  summarise_if(is.numeric, max) %>% # then summarise max if numeric
  mutate(Comparison = paste(GeneComp[i,]$Gen1, GeneComp[i,]$Gen2, sep = "-")) # and generate the comparisson variable
}

Comps <- bind_rows(Comps) # and finally join in a data frame

let me know if it does everything you want

Adding in order to get only the data one time

It is important here that your genes are strings and not factors so you might have to do this

options(stringsAsFactors = FALSE)

example.df <- data.frame( 
  gene = c(sample(c("A", "B", "C", "D"), 100, replace = TRUE)),
  treated = sample(c("Yes", "No"), 100, replace = TRUE), 
  resp=rnorm(100, 10,5), effect = rnorm (100, 25, 5))

Then again in expand.grid add the stringsAsFactors = F argument

GeneComp <- expand.grid(Gen1 = unique(example.df$gene), Gen2 = unique(example.df$gene), stringsAsFactors = F) %>% filter(Gen1 != Gen2) %>% arrange(Gen1)

Now that allows you in the loop when pasting the Comparisson variable to sort both inputs, with that, the lines will be duplicated, but when you use the distinct function at the end, it will make your data the way you want it

Comps <- list()
for(i in 1:nrow(GeneComp)){
    Comps[[i]] <- example.df %>% filter(gene == GeneComp[i,]$Gen1 | gene == GeneComp[i,]$Gen2) %>% # This line filters only the data with genes in the ith row
    group_by(treated) %>% # Then gorup by treated
    summarise_if(is.numeric, max) %>% # then summarise max if numeric
    mutate(Comparison = paste(sort(c(GeneComp[i,]$Gen1, GeneComp[i,]$Gen2))[1], sort(c(GeneComp[i,]$Gen1, GeneComp[i,]$Gen2))[2], sep = "-")) # and generate the comparisson variable
}

Comps <- bind_rows(Comps) %>% distinct() # and finally join in a data frame

Upvotes: 1

Related Questions