ery
ery

Reputation: 992

R: t-test over all subsets over all columns

This is a follow up question from R: t-test over all columns

Suppose I have a huge data set, and then I created numerous subsets based on certain conditions. The subsets should have the same number of columns. Then I want to do t-test on two subsets at a time (outer loop) and then for each combination of subsets go through all columns one column at a time (inner loop).

Here is what I have come up with based on previous answer. This one stops with an error.

C <- c("c1","c1","c1","c1","c1",
   "c2","c2","c2","c2","c2",
   "c3","c3","c3","c3","c3",
   "c4","c4","c4","c4","c4",
   "c5","c5","c5","c5","c5",
   "c6","c6","c6","c6","c6",
   "c7","c7","c7","c7","c7",
   "c8","c8","c8","c8","c8",
   "c9","c9","c9","c9","c9",
   "c10","c10","c10","c10","c10")
X <- rnorm(n=50, mean = 10, sd = 5)
Y <- rnorm(n=50, mean = 15, sd = 6)
Z <- rnorm(n=50, mean = 20, sd = 5)
Data <- data.frame(C, X, Y, Z)

Data.c1 = subset(Data, C == "c1",select=X:Z)
Data.c2 = subset(Data, C == "c2",select=X:Z)
Data.c3 = subset(Data, C == "c3",select=X:Z)
Data.c4 = subset(Data, C == "c4",select=X:Z)
Data.c5 = subset(Data, C == "c5",select=X:Z)

Data.Subsets = c("Data.c1",
                 "Data.c2",
                 "Data.c3",
                 "Data.c4",
                 "Data.c5") 

library(plyr)

combo1 <- combn(length(Data.Subsets),1)
adply(combo1, 1, function(x) {

  combo2 <- combn(ncol(Data.Subsets[x]),2)
  adply(combo2, 2, function(y) {

      test <- t.test( Data.Subsets[x][, y[1]], Data.Subsets[x][, y[2]], na.rm=TRUE)

      out <- data.frame("Subset" = rownames(Data.Subsets[x]),
                    , "Row" = colnames(x)[y[1]]
                    , "Column" = colnames(x[y[2]])
                    , "t.value" = round(test$statistic,3)
                    ,  "df"= test$parameter
                    ,  "p.value" = round(test$p.value, 3)
                    )
      return(out)
  } )
} )

Upvotes: 1

Views: 3988

Answers (2)

Davy Kavanagh
Davy Kavanagh

Reputation: 4939

You can use get(Data.subset[x]) which will pick out the relevant data frame. But I don't think this should be necessary.

Explicitly subsetting that many times shoudn't be necessry either. You could create them using something like

conditions = c("c1", "c2", "c3", "c4", "c5")
dfs <- lapply(conditions, function(x){subset(Data, C==x, select=X:Z)})

That should (didn't test it) return a list of data frames each subseted on the various conditions you passed it.

However it would be a much better idea as @Richie Cotton points out, to reshape your data frame and use pairwise t tests.

I should point out that doing this many t-tests doesn't seem wise. Even after correction for multiple testing, be it FDR, permutation or otherwise. It would be better to try and figure out if you can use an anova of some sort as they are used for almost exactly this purpose.

Upvotes: 1

Richie Cotton
Richie Cotton

Reputation: 121077

First of all, you can more easily define you dataset using gl, and by avoiding creating individual variables for the columns.

Data <- data.frame(
  C = gl(10, 5, labels = paste("c", 1:10, sep = "")),
  X = rnorm(n = 50, mean = 10, sd = 5),
  Y = rnorm(n = 50, mean = 15, sd = 6),
  Z = rnorm(n = 50, mean = 20, sd = 5)
)

Convert this to "long" format using melt from the reshape package. (You can also use the base reshape function.)

longData <- melt(Data, id.vars = "C")

Now Use pairwise.t.test to compute t tests on all pairs of X/Y/Z for for each level of C.

with(longData, pairwise.t.test(value, interaction(C, variable)))

Note that it is important to use pairwise.t.test rather than just lots of individual calls to t.test because you need to adjust your p values if you run lots of tests. (See, e.g., xkcd for explanation.)

In general, pairwise t tests are inferior to a regression so be careful about their usage.

Upvotes: 6

Related Questions