Wilcoxon Test in a loop for a number of datasets at the same time

Question

I have a question about whether I can do a Wilcoxon test in a loop for all the table generated.

Basically, I want to do a paired Wilcoxon test between 2 variables for each dataset, and the 2 variables are in the same position(like xth and yth column) for every dataset. (For people who are familiar with Biology, in fact this is the RPKM values for like between control and treated sample for some repetitive elements) And I hope I can generate a table for the p-value from Wilcoxon test for each dataset.

I ready generated all the tables/dataset/dataframe using the below code and I think I want to do a Wilcoxon test for each dataset so I think I need to continue with the loop but i don't know how to do it:

data=sample_vs_norm
filter=unique(data$family)

for(i in 1:length(filter)){
  table_name=paste('table_', filter[i], sep="")
  print(table_name)
  assign(table_name, data[data$Subfamily == filter[i]])

here is the structure of a single dataset: so basically i would like to do a Wilcoxon test between the variables "R009_initial_filter_rpkm" and "normal_filter_rpkm"

 Chr     Start       End Mappability Strand R009_initial_filter_NormalizedCounts
1: chr11 113086868 113087173           1      -                                        2
2:  chr2  24290845  24291132           1      -                                       11
3:  chr4  15854425  15854650           1      -                                        0
4:  chr6  43489623  43489676           1      +                                       11
   normal_filter_NormalizedCounts R009_initial_filter_rpkm
1:                                         14.569000                     0.169752
2:                                          1.000000                     0.992191
3:                                         14.815900                     0.000000
4:                                          0.864262                     5.372810
   normal_filter_rpkm FoldChange     p.value         FDR FoldChangeFPKM
1:                              1.236560   0.137278 0.999862671 1.000000000      0.1372776
2:                              0.000000  11.000000 0.003173828 0.008149271            Inf
3:                              1.704630   0.000000 1.000000000 1.000000000      0.0000000
4:                              0.422137  12.727600 0.003173828 0.008149271     12.7276453

structure(list(Chr = structure(1:4, .Label = c("chr11", "chr2", 
"chr4", "chr6"), class = "factor"), Start = c(113086868L, 24290845L, 
15854425L, 43489623L), End = c(113087173L, 24291132L, 15854650L, 
43489676L), Mappability = c(1L, 1L, 1L, 1L), Strand = structure(c(1L, 
1L, 1L, 2L), .Label = c("-", "+"), class = "factor"), R009_initial_filter_NormalizedCounts = c(2L, 
11L, 0L, 11L), normal_filter_NormalizedCounts = c(14.569, 
1, 14.8159, 0.864262), R009_initial_filter_rpkm = c(0.169752, 
0.992191, 0, 5.37281), normal_filter_rpkm = c(1.23656, 
0, 1.70463, 0.422137), FoldChange = c(0.137278, 11, 0, 12.7276
), p.value = c(0.999862671, 0.003173828, 1, 0.003173828), FDR = c(1, 
0.008149271, 1, 0.008149271), FoldChangeFPKM = c(0.1372776, Inf, 
0, 12.7276453), class = "data.frame", row.names = c(NA, 
-4L))

I'm sorry if I use incorrect terminology as I am a newbie in R, and thank you so much for the help

Ian Campbell · Accepted Answer

One approach is to use grouping with by = in data.table.

library(data.table)
setDT(data)
data[,wilcox.test(R009_initial_filter_rpkm,
                  normal_filter_rpkm)[c("statistic","p.value")],
     by = TE_Subfamily]
#   TE_Subfamily statistic p.value
#1:       AluYf4       7.5       1

You can group by any number of variables, for example TE_Subfamily and Chr:

data[TE_Subfamily %in% filter,
     wilcox.test(R009_initial_filter_rpkm,
                  normal_filter_rpkm)[c("statistic","p.value")],
     by = .(TE_Subfamily,Chr)]
#   TE_Subfamily   Chr statistic p.value
#1:       AluYf4 chr11         0       1
#2:       AluYf4  chr2         1       1
#3:       AluYf4  chr4         0       1
#4:       AluYf4  chr6         1       1

If you need to only perform comparisons for certain TE_Subfamily, you could try something like this:

filter <- c("AluYf4")
data[TE_Subfamily %in% filter,
     wilcox.test(R009_initial_filter_rpkm,
                  normal_filter_rpkm)[c("statistic","p.value")],
     by = TE_Subfamily]
#   TE_Subfamily statistic p.value
#1:       AluYf4       7.5       1

For bonus points, you can correct for multiple testing:

data[TE_Subfamily %in% filter,
     wilcox.test(R009_initial_filter_rpkm,
                  normal_filter_rpkm)[c("statistic","p.value")],
     by = TE_Subfamily][,adjusted.p.value := p.adjust(p.value,method = "bonferroni")][]

Wilcoxon Test in a loop for a number of datasets at the same time

Answers (1)

Related Questions