Why do mean() and mean(aggregate()) return different results?

Question

I want to calculate a mean. Here is the code with sample data:

# sample data
Nr <- c(1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
dph <- c(3.125000, 6.694737, 4.310680, 11.693735, 103.882353, 11.000000, 7.333333, 20.352941, 5.230769, NA, 4.615385, 47.555556, 2.941176, 18.956522, 44.320000, 28.500000, NA, 10.470588, 19.000000, 25.818182, 43.216783, 51.555556, 8.375000, 6.917647, 9.375000, 5.647059, 4.533333, 27.428571, 14.428571, NA, 1.600000, 5.764706, 4.705882, 55.272727, 2.117647, 30.888889, 41.222222, 23.444444, 2.428571, 6.200000, 17.076923, 21.280000, 40.829268, 14.500000, 6.250000, NA, 15.040000, 5.687204, 2.400000, NA, 26.375000, 18.064516, 4.000000, 6.139535, 8.470588, 128.666667, 2.235294, 34.181818, 116.000000, 6.000000, 5.777778, 10.666667, 15.428571, 54.823529, 81.315789, 42.333333)
dat <- data.frame(cbind(Nr = Nr, dph = dph))

# calculate mean directly
mean(dat$dph, na.rm = TRUE)
[1] 23.02403

# aggregate first, then calculate mean
mean(aggregate(dph ~ Nr, dat, mean, na.rm = T)$dph)
[1] 22.11743

# 23.02403 != 22.11743

Why do I get two different results?

Explanation for question:

I need to perform a Wilcoxon test, comparing a pre baseline with a post baseline. Pre is 3 measurements, post is 16. Because a Wilcoxon test needs two vectors of equal length, I calculate means for pre and post for each patient with aggregate, creating two vectors of equal length. Above data is pre.

Edit:

Patient no. 4 was removed from the data. But using Nr <- rep(1:22, 3) returns the same results.

talat · Accepted Answer

I think this is because in the mean(dat$x, na.rm=T) version, each NA that is removed, reduces the number of observations by 1, whereas if you aggregate first, in your example you have an NA in row 10 (ID 11) which is removed but since the other rows with ID 11 do not contain NAs (or at least one of them doesn't), the number of observations (unique IDs) you use to calculate the mean after aggregation for each ID, is not reduced by 1 for each NA. So the difference IMO comes from dividing the sum of dph, which should be the same in both calculations, by different numbers of observations.

You can verify this by changing NA entries to 0 and the calculating the mean again with both versions, they'll return the same.

But generally you should note that it only works here because you have the same number of observations for each ID (3 in this case). If they were different, you would again get different results.

Why do mean() and mean(aggregate()) return different results?

Answers (1)

Related Questions