Reputation: 13
I am a R-beginner and I am stuck and can't find a solution. Any remarks are highly appreciated. Here is the problem:
I have a dataframe df. The columns are converted to char (Attributes) and num. I want to reduce the dataframe by using the aggregate function (dply is not an option).
When I am aggregating using
df_agg <- aggregate(df["AMOUNT"], df[c("ATTRIBUTE1")], sum)
I get correct results. But I want to group by more attributes. When adding more attributes for example
df_agg <- aggregate(df["AMOUNT"], df[c("ATTRIBUTE1", "ATTRIBUTE2")], sum)
then at some point, the aggegrate result changes. The sum of Amount is no longer equal to the result of the first first aggegration (or the original dataframe).
Has anyone an idea what causes this behavior.
Upvotes: 1
Views: 438
Reputation: 20085
I think @Gregor has correctly pointed out that problem could be a grouping variable having NA
. The dplyr
handles NA
in grouping variables differently than aggregate
.
We have an alternate solution with aggregate
. Please note that document suggest that
`by` a list of grouping elements, each as long as the variables in the data frame x. The elements are coerced to factors before use.
Here is clue. You can convert your grouping variables to factor
using exclude=""
which will ensure NA
are part of factor.
set.seed(1)
df <- data.frame(ATTRIBUTE1 = sample(LETTERS[1:3], 10, replace = TRUE),
ATTRIBUTE2 = sample(letters[1:3], 10, replace = TRUE),
AMOUNT = 1:10)
df$ATTRIBUTE2[5] <- NA
aggregate(df["AMOUNT"], by = list(factor(df$ATTRIBUTE1,exclude = ""),
factor(df$ATTRIBUTE2, exclude="")), sum)
# Group.1 Group.2 AMOUNT
# 1 A a 1
# 2 B a 2
# 3 B b 9
# 4 C b 10
# 5 A c 10
# 6 B c 11
# 7 C c 7
# 8 A <NA> 5
The result when grouping variables are not explicitly converted to factor
to include NA
is as:
aggregate(df["AMOUNT"], df[c("ATTRIBUTE1", "ATTRIBUTE2")], sum)
# ATTRIBUTE1 ATTRIBUTE2 AMOUNT
# 1 A a 1
# 2 B a 2
# 3 B b 9
# 4 C b 10
# 5 A c 10
# 6 B c 11
# 7 C c 7
Upvotes: 0
Reputation: 145765
My best guess is that you have missing values in some of your grouping columns. Demonstrating on the built-in mtcars
data, which has no missing values, everything is fine:
sum(mtcars$mpg)
# [1] 642.9
sum(aggregate(mtcars["mpg"], mtcars[c("am")], sum)$mpg)
# [1] 642.9
sum(aggregate(mtcars["mpg"], mtcars[c("am", "cyl")], sum)$mpg)
# [1] 642.9
But if we introduce a missing value in a grouping variable, it is not included in the aggregation:
mt = mtcars
mt$cyl[1] = NA
sum(aggregate(mt["mpg"], mt[c("am", "cyl")], sum)$mpg)
# [1] 621.9
The easiest fix would be to fill in the missing values with something other than NA
, perhaps the string "missing"
.
Upvotes: 1