Reputation: 19
I have been trying to run a two side t-test in R but keep running into error. Below is my process flow, dataset details and script from R-studio. I used a dataset called LungCapacity that I downloaded from this website: https://www.statslectures.com/r-scripts-datasets.
#Imported data set into RStudio.
# Ran a summary report to see the data and class.
summary(LungCapData)
# Here I could see that the smoke column is a character, so I converted it to a factor
LungCapacityData$Smoke <- factor(LungCapacityData$Smoke)
# On checking the summary. I see its converted to a factor with a yes and no.
# I want to run a t-test between lung capacity and smoking.
t.test(LungCapData$LungCap, LungCapData$Smoke, alternative = c("two.sided"), mu=0, var.equal = FALSE, conf.level = 0.95, paired = FALSE)
Now on running this I get the following error.
Error in var(y) : Calling var(x) on a factor x is defunct.
Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
In addition: Warning message:
In mean.default(y) : argument is not numeric or logical: returning NA
I have tried to convert the smoke variable from Yes and No to 1 and 0. The data runs but is not correct. What am I doing wrong?
Upvotes: 0
Views: 2767
Reputation: 24848
You're very close, you just need to call t.test
with a formula:
LungCapacityData <- read.table(
"https://docs.google.com/uc?id=0BxQfpNgXuWoITmVwQzJ2VF9qVlU&export=download",
header = TRUE)
t.test(LungCap ~ Smoke, data = LungCapacityData,
alternative = c("two.sided"), mu=0, var.equal = FALSE,
conf.level = 0.95, paired = FALSE)
# Welch Two Sample t-test
#
#data: LungCap by Smoke
#t = -3.6498, df = 117.72, p-value = 0.0003927
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -1.3501778 -0.4003548
#sample estimates:
# mean in group no mean in group yes
# 7.770188 8.645455
With your current approach, you're trying to compare LungCapacityData$LungCap
which is a numeric vector:
LungCapacityData$LungCap[1:10]
# [1] 6.475 10.125 9.550 11.125 4.800 6.225 4.950 7.325 8.875 6.800
With LungCapacityData$Smoke
, which is a vector of factors:
LungCapacityData$Smoke[1:10]
# [1] no yes no no no no no no no no
Instead, you want to instruct t.test
to compare LungCapacityData$LungCap
when grouping by LungCapacityData$Smoke
. That is achieved with a formula.
The formula LungCap ~ Smoke
says that LungCap
should depend on Smoke
. When you use a formula, you also need to supply data =
.
When you try to convert LungCapacityData$Smoke
to numeric, you get the wrong result because you're just getting the factor level indices which have no biological significance.
as.numeric(LungCapacityData$Smoke)[1:10]
# [1] 1 2 1 1 1 1 1 1 1 1
You're basically asking if the mean of the factor levels we assigned is different than the mean of lung capacity.
The other way is to subset LungCapacityData$LungCap
yourself, but that's a lot more typing:
t.test(LungCapacityData$LungCap[LungCapacityData$Smoke == "yes"],
LungCapacityData$LungCap[LungCapacityData$Smoke == "no"],
alternative = c("two.sided"), mu=0, var.equal = FALSE,
conf.level = 0.95, paired = FALSE)
Upvotes: 3
Reputation: 10865
As specified in the OP, t.test()
attempts to compare the means of two vectors, so the t.test()
function expects them both to be numeric.
Instead, use the formula version of t.test()
. With this method, t.test()
uses the column on the right side of the ~
as the grouping variable, and the column on the left side of the ~
as the numeric variable whose means are to be compared across the two groups for the other variable.
data <- read.table(file = "./data/LungCapData.txt",header = TRUE)
t.test(LungCap ~ Smoke,data = data)
...and the output:
> t.test(LungCap ~ Smoke,data = data)
Welch Two Sample t-test
data: LungCap by Smoke
t = -3.6498, df = 117.72, p-value = 0.0003927
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.3501778 -0.4003548
sample estimates:
mean in group no mean in group yes
7.770188 8.645455
>
Upvotes: 0