R - Aggregate Function different Results When Adding new grouping column

Question

I am a R-beginner and I am stuck and can't find a solution. Any remarks are highly appreciated. Here is the problem:

I have a dataframe df. The columns are converted to char (Attributes) and num. I want to reduce the dataframe by using the aggregate function (dply is not an option).

When I am aggregating using

df_agg <- aggregate(df["AMOUNT"], df[c("ATTRIBUTE1")], sum)

I get correct results. But I want to group by more attributes. When adding more attributes for example

df_agg <- aggregate(df["AMOUNT"], df[c("ATTRIBUTE1", "ATTRIBUTE2")], sum)

then at some point, the aggegrate result changes. The sum of Amount is no longer equal to the result of the first first aggegration (or the original dataframe).

Has anyone an idea what causes this behavior.

MKR · Accepted Answer

I think @Gregor has correctly pointed out that problem could be a grouping variable having NA. The dplyr handles NA in grouping variables differently than aggregate.

We have an alternate solution with aggregate. Please note that document suggest that

`by`   a list of grouping elements, each as long as the variables in the data
       frame x. The elements are coerced to factors before use.

Here is clue. You can convert your grouping variables to factor using exclude="" which will ensure NA are part of factor.

set.seed(1)

df <- data.frame(ATTRIBUTE1 = sample(LETTERS[1:3], 10, replace = TRUE),
                 ATTRIBUTE2 = sample(letters[1:3], 10, replace = TRUE),
                 AMOUNT = 1:10)

df$ATTRIBUTE2[5] <- NA


aggregate(df["AMOUNT"], by = list(factor(df$ATTRIBUTE1,exclude = ""),
                                  factor(df$ATTRIBUTE2, exclude="")), sum)

#   Group.1 Group.2 AMOUNT
# 1       A       a      1
# 2       B       a      2
# 3       B       b      9
# 4       C       b     10
# 5       A       c     10
# 6       B       c     11
# 7       C       c      7
# 8       A          5

The result when grouping variables are not explicitly converted to factor to include NA is as:

aggregate(df["AMOUNT"], df[c("ATTRIBUTE1", "ATTRIBUTE2")], sum)
#   ATTRIBUTE1 ATTRIBUTE2 AMOUNT
# 1          A          a      1
# 2          B          a      2
# 3          B          b      9
# 4          C          b     10
# 5          A          c     10
# 6          B          c     11
# 7          C          c      7

R - Aggregate Function different Results When Adding new grouping column

Answers (2)

Related Questions