Reputation: 23
I'm trying to calculate the mean of some grouped data, but I'm running into an issue where the mean generated using base::mean()
is generating a different value than when I use base:rowMeans()
or try to replicate the mean in Excel.
Here's the code with a simplified data frame looking at just a small piece of the data:
df <- data.frame("ID" = 1101372,
"Q1" = 5.996667,
"Q2" = 6.005556,
"Q3" = 5.763333)
avg1 <- df %>%
summarise(new_avg = mean(Q1,
Q2,
Q3)) # Returns a value of 5.99667
avg2 <- rowMeans(df[,2:4]) # Returns a value of 5.921852
The value in avg2
is what I get when I use AVERAGE in Excel, but I can't figure out why mean()
is not generating the same number.
Any thoughts?
Upvotes: 2
Views: 224
Reputation: 887128
Here, the mean
is taking only the first argument i.e. Q1 as 'x' because the usage for ?mean
is
mean(x, trim = 0, na.rm = FALSE, ...)
i.e. the second and third argument are different. In the OP's code, x will be taken as "Q1", trim
as "Q2" and so on.. The ...
at the end also means that the user can supply n
number of parameters without any error and leads to confusions like this (if we don't check the usage)
We can specify the data as .
, subset the columns of interest and use that in rowMeans
df %>%
summarise(new_avg = rowMeans(.[-1]))
This would be more efficient. But, if we want to use mean
as such, then do a rowwise
df %>%
rowwise() %>%
summarise(new_avg = mean(c(Q1, Q2, Q3)))
# A tibble: 1 x 1
# new_avg
# <dbl>
#1 5.92
Or convert to 'long' format and then do the group_by
'ID' and get the mean
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>% # can skip this step if there is only a single row
summarise(new_avg = mean(value))
# A tibble: 1 x 2
# ID new_avg
# <dbl> <dbl>
#1 1101372 5.92
Upvotes: 1