Joe 5
Joe 5

Reputation: 19

Error message when running a T-Test in R due to character variable

I have been trying to run a two side t-test in R but keep running into error. Below is my process flow, dataset details and script from R-studio. I used a dataset called LungCapacity that I downloaded from this website: https://www.statslectures.com/r-scripts-datasets.

#Imported data set into RStudio.

# Ran a summary report to see the data and class.
summary(LungCapData)

# Here I could see that the smoke column is a character, so I converted it to a factor
LungCapacityData$Smoke <- factor(LungCapacityData$Smoke)

# On checking the summary. I see its converted to a factor with a yes and no.

# I want to run a t-test between lung capacity and smoking. 
t.test(LungCapData$LungCap, LungCapData$Smoke, alternative = c("two.sided"), mu=0, var.equal = FALSE, conf.level = 0.95, paired = FALSE)

Now on running this I get the following error.

Error in var(y) : Calling var(x) on a factor x is defunct.
  Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
In addition: Warning message:
In mean.default(y) : argument is not numeric or logical: returning NA

I have tried to convert the smoke variable from Yes and No to 1 and 0. The data runs but is not correct. What am I doing wrong?

Upvotes: 0

Views: 2767

Answers (2)

Ian Campbell
Ian Campbell

Reputation: 24848

You're very close, you just need to call t.test with a formula:

LungCapacityData <- read.table(
  "https://docs.google.com/uc?id=0BxQfpNgXuWoITmVwQzJ2VF9qVlU&export=download",
  header = TRUE)

t.test(LungCap ~ Smoke, data = LungCapacityData,
       alternative = c("two.sided"), mu=0, var.equal = FALSE,
       conf.level = 0.95, paired = FALSE)

#   Welch Two Sample t-test
#
#data:  LungCap by Smoke
#t = -3.6498, df = 117.72, p-value = 0.0003927
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -1.3501778 -0.4003548
#sample estimates:
# mean in group no mean in group yes 
#         7.770188          8.645455 

With your current approach, you're trying to compare LungCapacityData$LungCap which is a numeric vector:

LungCapacityData$LungCap[1:10]
# [1]  6.475 10.125  9.550 11.125  4.800  6.225  4.950  7.325  8.875  6.800

With LungCapacityData$Smoke, which is a vector of factors:

LungCapacityData$Smoke[1:10]
# [1] no  yes no  no  no  no  no  no  no  no 

Instead, you want to instruct t.test to compare LungCapacityData$LungCap when grouping by LungCapacityData$Smoke. That is achieved with a formula.

The formula LungCap ~ Smoke says that LungCap should depend on Smoke. When you use a formula, you also need to supply data =.

When you try to convert LungCapacityData$Smoke to numeric, you get the wrong result because you're just getting the factor level indices which have no biological significance.

as.numeric(LungCapacityData$Smoke)[1:10]
# [1] 1 2 1 1 1 1 1 1 1 1

You're basically asking if the mean of the factor levels we assigned is different than the mean of lung capacity.

The other way is to subset LungCapacityData$LungCap yourself, but that's a lot more typing:

t.test(LungCapacityData$LungCap[LungCapacityData$Smoke == "yes"],
       LungCapacityData$LungCap[LungCapacityData$Smoke == "no"],
       alternative = c("two.sided"), mu=0, var.equal = FALSE,
       conf.level = 0.95, paired = FALSE)

Upvotes: 3

Len Greski
Len Greski

Reputation: 10865

As specified in the OP, t.test() attempts to compare the means of two vectors, so the t.test() function expects them both to be numeric.

Instead, use the formula version of t.test(). With this method, t.test() uses the column on the right side of the ~ as the grouping variable, and the column on the left side of the ~ as the numeric variable whose means are to be compared across the two groups for the other variable.

data <- read.table(file = "./data/LungCapData.txt",header = TRUE)
t.test(LungCap ~ Smoke,data = data)

...and the output:

> t.test(LungCap ~ Smoke,data = data)

    Welch Two Sample t-test

data:  LungCap by Smoke
t = -3.6498, df = 117.72, p-value = 0.0003927
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.3501778 -0.4003548
sample estimates:
 mean in group no mean in group yes 
         7.770188          8.645455 

> 

Upvotes: 0

Related Questions