Tathagato
Tathagato

Reputation: 434

two sample t-test in r

I have a dataframe like this

df <- structure(list(ID = c(243, 292, 317, 388, 398, 404, 463, 473, 
842, 844, 858, 862, 869, 871, 879, 888), Zone = c(1, 1, 1, 1, 
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2), Gen = c("Male", "Male", 
"Other Gender Identity", "Male", "Male", "Male", "Male", "Female", 
"Female", "Male", "Female", "Male", "Male", "Male", "Male", "Female"
), Month_Inc = c("< $1,500", "< $1,500", "< $1,500", "$1,500 - $1,999", 
"$1,500 - $1,999", "< $1,500", "< $1,500", "< $1,500", "$1,500 - $1,999", 
"$2,000 - $2,499", "$1,500 - $1,999", "< $1,500", "$2,500 - $2,999", 
"< $1,500", "< $1,500", "< $1,500")), row.names = c(NA, -16L), class = c("tbl_df", 
"tbl", "data.frame"))

What I need to do is to test if there is a statistical difference for the percentage of females in the two zones. I need to test this for the income level too.

I need to do a t-test for Gen~Zone Ho = %female=%male for the two zones H1 = %female != %male for the two zones

Similarly, for the Month_Inc ~ Zone too!

I tried the following code

t.test(Gen ~ Zone, mu = 0, alt = "two.sided",
       conf=  0.95, paired = FALSE, ver.equal = FALSE, 
       data= df)

however, I am not getting anywhere! How do I correct it? I am thinking of something to do with the data type issue but I can't be certain.

Thanks for your help!

Upvotes: 1

Views: 323

Answers (1)

Thomas Bilach
Thomas Bilach

Reputation: 601

There is a statistical issue here that you're ignoring. Note, you're investigating a difference in the proportion of females between two areas. I would consider Fisher's exact test, which is a convenient non-parametric test when the sample sizes are not very large. In R, the prop.test() function should work well. First, we feed the function a vector of successes, which is just the count of the number of females within each zone. The next argument is a vector of sample sizes.

# Let's calculate the counts for the different zone-gender pairs

df |>
  group_by(Zone, Gen) |>
  summarize(Total = n())

# A tibble: 5 × 3
# Groups:   Zone [2]
   Zone Gen                   Total
  <dbl> <chr>                 <int>
1     1 Female                    1
2     1 Male                      6
3     1 Other Gender Identity     1
4     2 Female                    3
5     2 Male                      5

Since I'm working with a subset of your data, I can look at the counts directly and feed them into the prop.test() function. Here, we see 1 female in zone 1 and 3 females in zone 2.

prop.test(x = c(1, 3), n = c(8, 8), p = NULL, alternative = "two.sided", correct = TRUE)

    2-sample test for equality of proportions with continuity correction

data:  c(1, 3) out of c(8, 8)
X-squared = 0.33333, df = 1, p-value = 0.5637
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.7812791  0.2812791
sample estimates:
prop 1 prop 2 
 0.125  0.375

Please ignore any warning messages about the Chi-squared approximation. Since we're working with very small cell sizes, the estimates will be quite poor. I wouldn’t worry about it.

If, on the other hand, you’re interested in whether the population proportions of men and women are not equal, then you can perform this test individually within each respective zone.

Now, let's talking about individual income. You're supplying R with character values where numeric ones are required. To achieve something estimable with a standard t-test, we must make a sensible compromise. Say you want to estimate the mean difference in income between two discrete/independent groups. Opinions may differ, but using the midpoint between the interval is not uncommon. For example, the midpoint between $1,500 – $1,999 is $1,750. You'd do this for each individual observation. Although this is only an approximation, you can now calculate a central tendency.

Upvotes: 2

Related Questions