Error when using the Benjamini-Hochberg false discovery rate in R after Wilcoxon Rank

Question

I have carried out a Wilcoxon rank sum test to see if there is any significant difference between the expression of 598019 genes between three disease samples vs three control samples. I am in R.

When I see how many genes have a p value < 0.05, I get 41913 altogether. I set the parameters of the Wilcoxon as follows;

wilcox.test(currRow[4:6], currRow[1:3], paired=F, alternative="two.sided", exact=F, correct=F)$p.value

(This is within an apply function, and I can provide my total code if necessary, I was a little unsure as to whether alternative="two.sided" was correct).

However, as I assume correcting for multiple comparisons using the Benjamini Hochberg False Discovery rate would lower this number, I then adjusted the p values via the following code pvaluesadjust1 <- p.adjust(pvalues_genes, method="BH")

Re-assessing which p values are less than 0.05 via the below code, I get 0!

p_thresh1 <- 0.05
names(pvaluesadjust1) <- rownames(gene_analysis1)
output <- names(pvaluesadjust1)[pvaluesadjust1 < p_thresh1]
length(output)

I would be grateful if anybody could please explain, or direct me to somewhere which can help me understand what is going on!

Thank-you (As an extra question, would a t-test be fine due to the size of the data, the Anderson-Darling test showed that the underlying data is not normal. I had far less genes which were less than 0.05 using this statistical test rather than Wilcoxon (around 2000).

StupidWolf · Accepted Answer

Wilcoxon is a parametric test based on ranks. If you have only 6 samples, the best result you can get is rank 2,2,2 in disease versus 5,5,5 in control, or vice-versa.

For example, try the parameters you used in your test, on these random values below, and you that you get the same p.value 0.02534732.

wilcox.test(c(100,100,100),c(1,1,1),exact=F, correct=F)$p.value
wilcox.test(c(5,5,5),c(15,15,15),exact=F, correct=F)$p.value

So yes, with 598019 you can get 41913 < 0.05, these p-values are not low enough and with FDR adjustment, none will ever pass.

You are using the wrong test. To answer your second question, a t.test does not work so well because you don't have enough samples to estimate the standard deviation correctly. Below I show you an example using DESeq2 to find differential genes

library(zebrafishRNASeq)
data(zfGenes)
# remove spikeins
zfGenes = zfGenes[-grep("^ERCC", rownames(zfGenes)),]
head(zfGenes)
                   Ctl1 Ctl3 Ctl5 Trt9 Trt11 Trt13
ENSDARG00000000001  304  129  339  102    16   617
ENSDARG00000000002  605  637  406   82   230  1245

First three are controls, last three are treatment, like your dataset. To validate what I have said before, you can see that if you do a wilcoxon.test, the minimum value is 0.02534732

all_pvalues = apply(zfGenes,1,function(i){
wilcox.test(i[1:3],i[4:6],exact=F, correct=F)$p.value
})
min(all_pvalues,na.rm=T)
# returns 0.02534732

So we proceed with DESeq2

library(DESeq2)
#create a data.frame to annotate your samples
DF = data.frame(id=colnames(zfGenes),type=rep(c("ctrl","treat"),each=3))
# run DESeq2
dds = DESeqDataSetFromMatrix(zfGenes,DF,~type)
dds = DESeq(dds)
summary(results(dds),alpha=0.05)

out of 25839 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 69, 0.27%
LFC < 0 (down)     : 47, 0.18%
outliers [1]       : 1270, 4.9%
low counts [2]     : 5930, 23%
(mean count < 7)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

So you do get hits which pass the FDR cutoff. Lastly we can pull out list of significant genes

res = results(dds)
res[which(res$padj < 0.05),]

Error when using the Benjamini-Hochberg false discovery rate in R after Wilcoxon Rank

Answers (1)

Related Questions