melbez
melbez

Reputation: 1000

Running multiple t.tests to compare pairs of column values in R

I have a dataframe that looks like this:

Age  A1U_sweet  A2F_dip  A3U_bbq  C1U_sweet  C2F_dip  C3U_bbq  Comments
23   1          2        1        NA         NA       NA       Good
54   NA         NA       NA       4          1        2        ABCD
43   2          4        7        NA         NA       NA       HiHi

I am trying to run a series of t.tests to compare columns beginning with A# and the corresponding columns beginning with C#. I have been doing this manually by typing the following for each pair of columns.

t.test(df$A1U_sweet, df$C1U_sweet)

Is there a way for me to run t.tests for A1U and C1U, A2U and C2U, and A3U and C3U? I tried using an apply function and also a for loop but was unable to figure out how to make these work in this instance.

df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
                 Age  A1U_sweet  A2F_dip  A3U_bbq  C1U_sweet  C2F_dip  C3U_bbq  Comments
                  23   1          2        1              2         5       5       Good
                  54   1          3        1              4         1       2       ABCD
                  43   2          4        7              1         1       1       HiHi")

Upvotes: 1

Views: 2004

Answers (2)

Ista
Ista

Reputation: 10437

The task itself is not difficult or complicated, though it appears that way because of the way the data is arranged. When you see variable names that convey more than one piece of information it is often helpful to ask yourself if the data can be arranged in simpler way. This simple claim is at the heart of the popular "tidy" approach to data manipulation in R. While I'm not a fan of everything that has been done in the name of being "tidy", this core claim is sound, and you violate it (as you've done spectacularly here) only at the risk of making your analysis much more difficult than it needs to be.

A good first step is to re-arrange the data so that data is not encoded in the column names:

df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
                 Age  A1U_sweet  A2F_dip  A3U_bbq  C1U_sweet  C2F_dip  C3U_bbq  Comments
                  23   1          2        1        2         5       5       Good
                  54   1         3       1       4          1        2        ABCD
                  43   2          4        7        1         1       1       HiHi")

library(tidyr)

df <- data.frame(id = 1:nrow(df), df)

dfl <- gather(df, key = "key", value = "value", -id, -Age, -Comments)
dfl <- separate(dfl, key, into = c("key", "kind", "type"), sep = c(1, 4))
dfl
##    id Age Comments key kind  type value
## 1   1  23     Good   A  1U_ sweet     1
## 2   2  54     ABCD   A  1U_ sweet     1
## 3   3  43     HiHi   A  1U_ sweet     2
## 4   1  23     Good   A  2F_   dip     2
## 5   2  54     ABCD   A  2F_   dip     3
## 6   3  43     HiHi   A  2F_   dip     4
## 7   1  23     Good   A  3U_   bbq     1
## 8   2  54     ABCD   A  3U_   bbq     1
## 9   3  43     HiHi   A  3U_   bbq     7
## 10  1  23     Good   C  1U_ sweet     2
## 11  2  54     ABCD   C  1U_ sweet     4
## 12  3  43     HiHi   C  1U_ sweet     1
## 13  1  23     Good   C  2F_   dip     5
## 14  2  54     ABCD   C  2F_   dip     1
## 15  3  43     HiHi   C  2F_   dip     1
## 16  1  23     Good   C  3U_   bbq     5
## 17  2  54     ABCD   C  3U_   bbq     2
## 18  3  43     HiHi   C  3U_   bbq     1

This might seem like a lot of work, but it makes the data much easier to work with, and not only for this particular operation.

Now that the data has been converted to a sane arrangement the actual task is very simple:

lapply(split(dfl, dfl$type), function(d) t.test(value ~ key, data = d))
## $bbq
## 
##  Welch Two Sample t-test
##  
## data:  value by key
## t = 0.14286, df = 3.2778, p-value = 0.8947
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.748715  7.415381
## sample estimates:
## mean in group A mean in group C 
##        3.000000        2.666667 
##
##
## $dip
## 
##  Welch Two Sample t-test
## 
## data:  value by key
## t = 0.45883, df = 2.7245, p-value = 0.6805
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.233396  5.566729
## sample estimates:
## mean in group A mean in group C 
##        3.000000        2.333333 
## 
## 
## $sweet
## 
##  Welch Two Sample t-test
## 
## data:  value by key
## t = -1.0607, df = 2.56, p-value = 0.3785
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.31437  2.31437
## sample estimates:
## mean in group A mean in group C 
##        1.333333        2.333333 

Upvotes: 1

akrun
akrun

Reputation: 887831

If we need to do the t.test on corresponding '1s', '2s' and '3s' for 'A' and 'C', then split the dataseet based on the substring of the column names with only numbers and then apply t.test

lapply(split.default(df[2:7], gsub("\\D+", "", names(df)[2:7])), t.test)

Upvotes: 1

Related Questions