Reputation: 5
I'm running a 2-way ANOVA to compare how two variables affect the price of a service. However, when I run post-hoc pairwise comparisons I would like to only see rows that are comparable (i.e. when one of the independent variables matches). Here is an example to show the issue:
Variable 1 - Species: tuna, halibut, salmon, flounder
Variable 2 - Sex: Male, female
Dependent variable: mass
Here is what I'm using for 2-way ANOVA and pairwise comparisons:
> dat <- read.csv(<file location>)
> aov2 <- aov(mass ~ species * sex, data = dat)
> TukeyHSD(aov2, which = "species:sex")
When I run the pairwise comparisons of mass using TukeyHSD, I would only like to see ones such as Tuna:Male - Tuna:Female or Halibut:Male - Flounder:Male, but I would not like to see Flounder:Female - Halibut:Male. Basically, I just want to see comparisons where one of the variables has the same value.
Honestly am not even sure what to try, or if this is even possible. I had some NA values for some of the comparisons and I tried using:
TukeyHSD(aov2, which = "species:sex") %>% filter(is.na(diff))
but I get the following:
Error in UseMethod("filter") :
no applicable method for 'filter' applied to an object of class "c('TukeyHSD', 'multicomp')"
If this is not possible, is there a way to filter data from a dataframe before it goes into 1-way ANOVA, for example by a certain value in one of the columns? Then I can try to run multiple 1-way ANOVAs to accomplish a similar goal. I would rather not have to create separate data files because the actual dataset I'm working with is very large and would be time-consuming to manually separate.
Upvotes: 0
Views: 274
Reputation: 19349
This is not possible with the current TukeyHSD
function. However, with some modifications of this function, you can get the desired output. Below is the modified function, which adds a new argument called "match", the default being FALSE:
TukeyHSD2 <- function (x, which = seq_along(tabs), ordered = FALSE, conf.level = 0.95,
match=FALSE, ...)
{
mm <- model.tables(x, "means")
if (is.null(mm$n))
stop("no factors in the fitted model")
tabs <- mm$tables
if (names(tabs)[1L] == "Grand mean")
tabs <- tabs[-1L]
tabs <- tabs[which]
nn <- mm$n[names(tabs)]
nn_na <- is.na(nn)
if (all(nn_na))
stop("'which' specified no factors")
if (any(nn_na)) {
warning("'which' specified some non-factors which will be dropped")
tabs <- tabs[!nn_na]
nn <- nn[!nn_na]
}
out <- setNames(vector("list", length(tabs)), names(tabs))
MSE <- sum(x$residuals^2)/x$df.residual
for (nm in names(tabs)) {
tab <- tabs[[nm]]
means <- as.vector(tab)
nms <- if (length(dim(tab)) > 1L) {
dn <- dimnames(tab)
apply(do.call("expand.grid", dn), 1L, paste, collapse = ":")
}
else names(tab)
n <- nn[[nm]]
if (length(n) < length(means))
n <- rep.int(n, length(means))
if (as.logical(ordered)) {
ord <- order(means)
means <- means[ord]
n <- n[ord]
if (!is.null(nms))
nms <- nms[ord]
}
center <- outer(means, means, `-`)
keep <- lower.tri(center)
dnames <- list(outer(nms, nms, paste, sep = "-"), c("diff", "lwr", "upr", "p adj"))
dk1 <- dnames[[1L]][keep]
if(match) {
keep2 <- sapply(1:length(dk), function(x) {
pattern <- '(.+):(.+)-(.+):(.+)'
d1 <- sub(pattern, '\\1', dk1[x])
d2 <- sub(pattern, '\\2', dk1[x])
d3 <- sub(pattern, '\\3', dk1[x])
d4 <- sub(pattern, '\\4', dk1[x])
d1==d3 | d2==d4 } )
dnames[[1L]] <- dk1[keep2]
} else dnames[[1L]] <- dk1
center <- center[keep]
width <- qtukey(conf.level, length(means), x$df.residual) *
sqrt((MSE/2) * outer(1/n, 1/n, `+`))[keep]
est <- center/(sqrt((MSE/2) * outer(1/n, 1/n, `+`))[keep])
if(match) {
center <- center[keep2]
width <- width[keep2]
est <- est[keep2]
}
pvals <- ptukey(abs(est), length(means), x$df.residual,
lower.tail = FALSE)
out[[nm]] <- array(c(center, center - width, center +
width, pvals), c(length(width), 4L), dnames)
}
class(out) <- c("TukeyHSD", "multicomp")
attr(out, "orig.call") <- x$call
attr(out, "conf.level") <- conf.level
attr(out, "ordered") <- ordered
out
}
Send the above function to your R console and then test it on your data. Here is the output using the "warpbreaks" example dataset.
fm1 <- aov(breaks ~ wool * tension, data = warpbreaks)
> TukeyHSD(fm1, which="wool:tension")
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = breaks ~ wool * tension, data = warpbreaks)
$`wool:tension`
diff lwr upr p adj
B:L-A:L -16.3333333 -31.63966 -1.027012 0.0302143
A:M-A:L -20.5555556 -35.86188 -5.249234 0.0029580
B:M-A:L -15.7777778 -31.08410 -0.471456 0.0398172
A:H-A:L -20.0000000 -35.30632 -4.693678 0.0040955
B:H-A:L -25.7777778 -41.08410 -10.471456 0.0001136
A:M-B:L -4.2222222 -19.52854 11.084100 0.9626541
B:M-B:L 0.5555556 -14.75077 15.861877 0.9999978
A:H-B:L -3.6666667 -18.97299 11.639655 0.9797123
B:H-B:L -9.4444444 -24.75077 5.861877 0.4560950
B:M-A:M 4.7777778 -10.52854 20.084100 0.9377205
A:H-A:M 0.5555556 -14.75077 15.861877 0.9999978
B:H-A:M -5.2222222 -20.52854 10.084100 0.9114780
A:H-B:M -4.2222222 -19.52854 11.084100 0.9626541
B:H-B:M -10.0000000 -25.30632 5.306322 0.3918767
B:H-A:H -5.7777778 -21.08410 9.528544 0.8705572
And here is the output from the modified version with "match" set to TRUE.
> TukeyHSD2(fm1, which="wool:tension", match=TRUE)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = breaks ~ wool * tension, data = warpbreaks)
$`wool:tension`
diff lwr upr p adj
B:L-A:L -16.3333333 -31.63966 -1.027012 0.0302143
A:M-A:L -20.5555556 -35.86188 -5.249234 0.0029580
A:H-A:L -20.0000000 -35.30632 -4.693678 0.0040955
B:M-B:L 0.5555556 -14.75077 15.861877 0.9999978
B:H-B:L -9.4444444 -24.75077 5.861877 0.4560950
B:M-A:M 4.7777778 -10.52854 20.084100 0.9377205
A:H-A:M 0.5555556 -14.75077 15.861877 0.9999978
B:H-B:M -10.0000000 -25.30632 5.306322 0.3918767
B:H-A:H -5.7777778 -21.08410 9.528544 0.8705572
Upvotes: 0