Reputation: 11
While running ANOVA analysis on a subset of my dataset this error is displayed:
Error in model.frame.default(formula = ready$GDPpercapita[ready$cluster == : variable lengths differ (found for 'ready$GDPpercapita[ready$cluster == 3]')
Following is my code:
for(a in 1:6){
for(b in a+1:6){
result=paste("GDPpercapita CLusters ",a,"&",b)
print( result)
first<-subset(ready,ready$cluster==a)
second<-subset(ready,ready$cluster==b)
x<-summary(aov(first$GDPpercapita~second$GDPpercapita))
print(x)
}
}
And here is a glimpse of my data:
The error is not because of the loop or because of creating the subsets , as the following code also returns the same error:
x<-summary(aov(ready$GDPpercapita[ready$cluster==1]~ready$GDPpercapita[ready$cluster==2]))
print(x)
The column cluster is a factor variable. My objective is to run an ANOVA for every variable(eg GDPpercapita) for all pairs of clusters.
Any help will be appreciated.
Upvotes: 1
Views: 3250
Reputation: 323
The problem is you're attempting to use ANOVA like a t-test.
ready$GDPpercapita[ready$cluster==1]
returns a vector of GDPpercapita values.
ready$GDPpercapita[ready$cluster==2]
returns a different vector of GDPpercapita values.
You now have two vectors that each hold values of your response variable. When attempting to compare the means of two groups of observed values, a 2-sample test (the t-test is very common) should be used. ANOVA is overkill, and it's really meant for comparing means across many groups.
The code
t.test(ready$GDPpercapita[ready$cluster==1], ready$GDPpercapita[ready$cluster==2])
would compare the mean GDPpercapita between clusters 1 and 2.
Since you are trying to compare the mean GDPpercapita across many groups, you could do this for each cluster (commonly referred to as pairwise t-testing). However, you'd have to use a correction (like Bonferroni) which is generally not ideal.
Alternatively, ANOVA seems like a good start as long as it's used correctly.
The aov
function takes a formula as its first parameter. You build the formula using ~
(think of this operator as "by") and +
(this operator is like "and").
To compare mean GDPpercapita of each cluster against each other cluster, you would call:
aov(GDPpercapita ~ cluster, data = ready)
Read this as "within the dataset ready, compare mean GDPpercapita BY every value of the variable cluster".
You can add more comparisons like this:
aov(GDPpercapita ~ cluster + CurrentHE, data = ready)
Read this as "within the dataset ready, compare mean GDPpercapita BY every value of the variable cluster AND every value of the variable CurrentHE".
Note that you may have to do post-hoc testing, as ANOVA will only tell you which included variables (cluster, CurrentHE, etc) seem to significantly affect the response variable. It won't give specific information like "cluster 1 is associated with a higher GDPpercapita than cluster 2". I would recommend reading up on ANOVA and t-testing, as well as how to use them in R.
Upvotes: 1