Sarah
Sarah

Reputation: 21

ANOVA within subsets of data

My dataset consists of gene expression values recorded over 3 timepoints I am trying to apply a anova test with tukey correction to look for differential expression of a gene across the timepoints. So for each gene i want a comparison like: gene a timepoint 1 vs 2 gene a timepoint 2 vs 3 gene a timepoint 3 vs 1

My data is in the below format:

    > head(rf)
                gene      expn timepoint rep
2  EG620009  // EG620009  8.428851      x0hr   0
3                 LYPLA1 10.386500      x0hr   0
21 EG620009  // EG620009  8.582346      x0hr   1
31                LYPLA1 10.379710      x0hr   1
22 EG620009  // EG620009  8.566248      x0hr   2
32                LYPLA1 10.399080      x0hr   2
    > tail(rf)
                gene      expn timepoint rep
23 EG620009  // EG620009  8.561409     x24hr   0
33                LYPLA1 10.233400     x24hr   0
24 EG620009  // EG620009  8.750639     x24hr   1
34                LYPLA1 10.023780     x24hr   1
25 EG620009  // EG620009  8.560267     x24hr   2
35                LYPLA1 10.025980     x24hr   2

If I were to do:

TukeyHSD(aov(rf$expn ~ rf$timepoint * rf$gene))

this would give me a comparison between every timepoint across all of the genes ie. including comparisons such as gene a timepoint 1 vs gene b timepoint 2

I have been trying to work out how to apply the aov function to the data subsetting by gene. I have defined a function that gives the p value as output and tried to apply that to each gene individually using the by function as below;

> gene.aov = function(x) {TukeyHSD(aov(expn ~ timepoint, data = x))}
> aov.pval = function(y) {y$timepoint[,4]}
> gene.pval = function(z) {aov.pval(gene.aov(z))}
> pvals = by(rf$expn,list(rf$gene),gene.pval)
> Error in eval(predvars, data, env) : 
   numeric 'envir' arg not of length one 

Any hint why this doesn't work? Or should i be approaching this problem in a completely different way? Thanks!

Upvotes: 2

Views: 1650

Answers (1)

Nate
Nate

Reputation: 10671

it's not working because by expects it's first argument to be a data.frame or matrix, you are passing is rf$exp which is a numeric vector. You could do this and it will work fine (I abandoned the multiple functions for readability).

by(rf, rf$gene, function(x) {TukeyHSD(aov(expn ~ timepoint, data = x))}, simplify = F)

rf$gene: EG620009
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = expn ~ timepoint, data = x)

$timepoint
              diff       lwr      upr     p adj
x24hr-x0hr 0.09829 -0.123391 0.319971 0.2857424

--------------------------------------------------------------------------- 
rf$gene: LYPLA1
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = expn ~ timepoint, data = x)

$timepoint
                 diff        lwr       upr     p adj
x24hr-x0hr -0.2940433 -0.4876756 -0.100411 0.0135193

Upvotes: 1

Related Questions