Aryh
Aryh

Reputation: 595

For loop in R to run Chi_square test on all columns

head(Data)
                       
          Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 Gene8 
  SAMPLE1   0    1      0      0     0     1   1     1          
  SAMPLE2   0    1      0      1     0     1   0     1        
  SAMPLE3   0    0      0      1     0     0   1     1          
  SAMPLE4   1    0      0      0     1     0   0     0    

I am running fisher test and chi square test like following

Gene7_Gene8<- table (Data[, c ("Gene7", "Gene8")])
chisq.test( Gene7_Gene8) and 
fisher.test(Gene7_Gene8)

How can i run the chisq.test by for loop in all the data sets at once. I am looking for p.value against each possible pair of gene in my data set. (my original data has around 60 columns)

Upvotes: 1

Views: 732

Answers (3)

akrun
akrun

Reputation: 887118

We may use combn

lst1 <- combn(seq_len(ncol(Data)), 2, function(i) 
   FUN = chisq.test(table(Data[, i])), simplify = FALSE)
names(lst1) <- combn(colnames(Data), 2, FUN = paste, collapse = "_")

Upvotes: 3

RYann
RYann

Reputation: 683

I hope this works out for you. enter image description here

# This is a fake dataset similar to the one in the Opening post:
df <- data.frame(
  
  gene1 = sample(c(0,1),10,T),
  gene2 = sample(c(0,1),10,T),
  gene3 = sample(c(0,1),10,T),
  gene4 = sample(c(0,1),10,T),
  gene5 = sample(c(0,1),10,T)
)
rownames(df) <- paste0("sample",1:nrow(df))



# ldf stands for long df, and is a longer version, which will make it easier to analysie.
ldf <- df %>% pivot_longer(cols = everything(),names_to = "gene_type",
                           values_to = "true_false") %>%
  arrange(gene_type)
# ndf stands for nested df, and is the already nested-by-genes data frame we will analyse
ndf <- ldf %>% nest_by(gene_type)

# Here we create a new colum named p_value and in the map() function
# we extract the p values from a chiq.test erformed on each gene
ndf$p_value <- map(ndf$data,~chisq.test(table(.x))$p.value) %>% unlist()


# Now we have our final data frame and can further anlyse and create plots.
final_df <- ndf %>% unnest(cols = data)
final_df %>% group_by(gene_type) %>% summarise(p.value=mean(p_value)) %>% 
  mutate(is_significant=p.value<0.05)

# In my personal opinion, leaving the data frame in a long version is best suited for further anlysis.

final_df %>% group_by(gene_type) %>% summarise(p.value=mean(p_value)) %>% 
  mutate(is_significant=p.value<0.05) %>% 
  ggplot(aes(x=gene_type,y=p.value,color=is_significant))+
  geom_text(aes(label=gene_type),show.legend = F)+
  labs(title = "P values of different genes")+
  theme_minimal()

Upvotes: 3

r2evans
r2evans

Reputation: 160447

No for loop is required, we can do them as such:

com <- data.frame(t(combn(colnames(Data), 2)))
com
#       X1    X2
# 1  Gene1 Gene2
# 2  Gene1 Gene3
# 3  Gene1 Gene4
# 4  Gene1 Gene5
# 5  Gene1 Gene6
# 6  Gene1 Gene7
# 7  Gene1 Gene8
# 8  Gene2 Gene3
# ...

chisq <- apply(com, 1, function(i) chisq.test(table(Data[,i]))$p.value)
# Warning in chisq.test(table(Data[, i])) :
#   Chi-squared approximation may be incorrect
# Warning in chisq.test(table(Data[, i])) :
#   Chi-squared approximation may be incorrect
# Warning in chisq.test(table(Data[, i])) :
#   Chi-squared approximation may be incorrect
# ...

chisq
#  [1] 1.0000000 0.3173105 1.0000000 0.5049851 1.0000000 1.0000000 0.5049851 1.0000000 1.0000000 1.0000000 0.3173105
# [12] 1.0000000 1.0000000 1.0000000 0.3173105 1.0000000 1.0000000 0.3173105 1.0000000 1.0000000 1.0000000 1.0000000
# [23] 1.0000000 1.0000000 0.5049851 1.0000000 1.0000000 1.0000000

For the fisher test, we run into some problems:

fisher <- apply(com, 1, function(i) fisher.test(table(Data[,i]))$p.value)
# Error in fisher.test(table(Data[, i])) : 
#   'x' must have at least 2 rows and columns

This is because not all tables are 2x2, so we need to catch that error and return something meaningful. I suggest NA:

fisher <- apply(com, 1, function(i) tryCatch(fisher.test(table(Data[,i]))$p.value, error = function(e) NA_real_))
fisher
#  [1] 1.0000000        NA 1.0000000 0.2500000 1.0000000 1.0000000 0.2500000        NA 1.0000000 1.0000000 0.3333333
# [12] 1.0000000 1.0000000        NA        NA        NA        NA        NA 1.0000000 1.0000000 1.0000000 1.0000000
# [23] 1.0000000 1.0000000 0.2500000 1.0000000 1.0000000 1.0000000

From here, we can easily combine them:

out <- cbind(com, chisq=chisq, fisher=fisher)
out
#       X1    X2     chisq    fisher
# 1  Gene1 Gene2 1.0000000 1.0000000
# 2  Gene1 Gene3 0.3173105        NA
# 3  Gene1 Gene4 1.0000000 1.0000000
# 4  Gene1 Gene5 0.5049851 0.2500000
# 5  Gene1 Gene6 1.0000000 1.0000000
# 6  Gene1 Gene7 1.0000000 1.0000000
# 7  Gene1 Gene8 0.5049851 0.2500000
# 8  Gene2 Gene3 1.0000000        NA
# 9  Gene2 Gene4 1.0000000 1.0000000
# ...

While combn does have the FUN= argument, I found it easier to reuse the results for the fisher.test automation as well. It also clears up the output (in my mind) since it clearly shows which genes are considered in each comparison.

Upvotes: 2

Related Questions