Reputation: 73
I'm trying to run an anova test for descriptive variables with 4 different groups, those 4 groups are grouped according to the presence or absence of 2 complications.
My data
structure(list(values = c("F", "F", "M", "F", "F", "M", "F",
"F", "F", "F", "F", "F", "F", "M", "M", "F", "F", "F", "F", "M"
), ind = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Group 1",
"Group 2 ", "Group 3", "Group 4"), class = "factor")), row.names = c(NA,
20L), class = "data.frame")
I tried the code below to run the anova test
anovaresult= aov(data_new$values ~ data_new$ind, data=data_new)
and I'm getting the error message below:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in 'y'
In addition: Warning message:
In storage.mode(v) <- "double" : NAs introduced by coercion
>
Many thanks
Please note my df is created from stacking the 4 groups together by the function
stacked()
Upvotes: 1
Views: 1337
Reputation: 17240
An ANOVA is used when you have a categorical independent variable and you want to test for differences between the means of a normally distributed continuous dependent variable. Your dependent variable is dichotomous (M/F), so ANOVA is not appropriate.
Lets say you have categorical data similar to your data such as this:
# Data
set.seed(123)
df <- data.frame(result = sample(0:1, 100, replace = TRUE),
group = sample(paste("Group", 1:4), 100, replace = TRUE))
Since the data are randomly drawn from a uniform distribution, we would not expect any difference between the groups. We can test this statistically using a Chi-squared test, a popular choice. In R this is implemented as:
# Parametric Chi Squared
chisq.test(df$result, df$group)
# Pearson's Chi-squared test
#
# data: df$result and df$group
# X-squared = 0.18662, df = 3, p-value = 0.9797
Here you see the p-value is well above the standard 0.05, so we would conclude there is no difference.
If these data were nonparametric (ie, likert-style data), we could use a nonparametric analog called the Kruskal-Wallace. In R this is implemented as:
kruskal.test(df$result, df$group)
# Kruskal-Wallis rank sum test
#
# data: df$result and df$group
# Kruskal-Wallis chi-squared = 0.18475, df = 3, p-value = 0.98
You could also use logistic regression to examine the strength of the association, if any. In R this could be implemented by:
mdl <- glm(result ~ group, data = df, family = binomial(link = "logit"))
summary(mdl)
# Call:
# glm(formula = result ~ group, family = binomial(link = "logit"),
# data = df)
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -1.128 -1.034 -1.034 1.281 1.328
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -0.1178 0.4859 -0.242 0.808
# groupGroup 2 -0.2305 0.6150 -0.375 0.708
# groupGroup 3 -0.2305 0.6150 -0.375 0.708
# groupGroup 4 -0.1234 0.6312 -0.195 0.845
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 136.66 on 99 degrees of freedom
# Residual deviance: 136.48 on 96 degrees of freedom
# AIC: 144.48
#
# Number of Fisher Scoring iterations: 4
Note that in logistic regression you would want to transform these coefficients and standard errors to give the Odds Ratio (OR). In R you could do this by:
exp(coef(mdl))
# (Intercept) groupGroup 2 groupGroup 3 groupGroup 4
# 0.8888889 0.7941176 0.7941176 0.8839286
exp(confint(mdl))
# 2.5 % 97.5 %
# (Intercept) 0.3337300 2.324904
# groupGroup 2 0.2352175 2.680856
# groupGroup 3 0.2352175 2.680856
# groupGroup 4 0.2537607 3.081575
As you can see, the OR confidence intervals contain the null (no difference) - as expected.
These are just some examples of how to implement statistical tests and measures of effect in your type of data, but is not comprehensive. Good Luck!
Upvotes: 3