Reputation: 993
I am creating a chi-squared test for independence on a data.frame
called Comp1 with two binary variables and 13109 obs.
I am using the test before clustering consumers based on demographics. If the two variables are dependent on one another, then certain values will be in a cluster. The two variables are a subset from another data.frame
with 36 variables.
I got an error saying the data.frame
had character
variables instead of factors
that the str()
function shows.
Why does the error say the data.frame
has character
values?
data:
> str(Comp1)
'data.frame': 13109 obs. of 2 variables:
$ HomeOwnerStatus: Factor w/ 2 levels "Own","Rent": 1 2 2 2 1 2 1 1 2 2 ...
$ MaritalStatus : Factor w/ 2 levels "Married","Single": 2 1 1 1 2 1 2 1 1 1 ...
example:
> #Create dataset
> homeownerstatus <- c("Own", "Rent", "Own", "Own", "Rent", "Own")
> maritalstatus <- c("Married", "Married", "Married", "Single", "Single", "Married")
> Comp1 <- data.frame(homeownerstatus, maritalstatus)
error with solution:
> #Test binary variables for independence
> #Create matrix from data.frame
> DF4 <- as.matrix(Comp1)
> #Comparison of marital status and home owner status
> #Perform chi-squared test for independence of two variables
> chisq.test(table(Comp1))
Chi-squared test for given probabilities
data: table(DF4)
X-squared = 295149.5, df = 71, p-value < 2.2e-16
Upvotes: 1
Views: 11290
Reputation: 19454
chisq.test
either wants a factor vector for both its x
and y
arguments or a matrix
or data.frame
for the x
argument. When a data.frame
is passed, this gets converted to a matrix
by the function as.matrix
. This step coerces the factor columns in your data.frame
to character.
> as.matrix(Comp1)
homeownerstatus maritalstatus
[1,] "Own" "Married"
[2,] "Rent" "Married"
[3,] "Own" "Married"
[4,] "Own" "Single"
[5,] "Rent" "Single"
[6,] "Own" "Married"
So, my suggestion would be to pass two factor vectors:
chisq.test(Comp1$homeownerstatus, Comp1$maritalstatus)
Pearson's Chi-squared test with Yates' continuity correction
data: Comp1$homeownerstatus and Comp1$maritalstatus
X-squared = 0, df = 1, p-value = 1
Warning message:
In chisq.test(Comp1$homeownerstatus, Comp1$maritalstatus) :
Chi-squared approximation may be incorrect
EDIT
When you pass a matrix or a data.frame to the x
argument, that object is taken to be a contingency table, which is not what you want here. You have two binary variables whose contingency table should be calculated and then tested according to the chi-squared test. Therefore you should pass each factor vector as described above or, alternatively, calculate the contingency table and pass that to chisq.test
.
chisq.test(table(Comp1))
Upvotes: 1