aggregate function is not intuitive

Question

Like the following source code about aggregation function, I can't understand why we have to use list function() in here. Rather than I want to replace this with using one column that is needs to be grouped by. And I don't know why we use the same dataset like 'train[Sales != 0]' twice? What if I use other dataset as a second dataset param? I think it will make change to be fairly high possible mistake.

aggregate(train[Sales != 0]$Sales, 
               by = list(train[Sales != 0]$Store), mean)

Maybe one who can say this is wrong use case. But I also saw this source code in R Documentation

## Compute the averages for the variables in 'state.x77', grouped
## according to the region (Northeast, South, North Central, West) that
## each state belongs to.
aggregate(state.x77, list(Region = state.region), mean)

Thanks for reading my question.

Jasper · Accepted Answer

First of all, if you don't like the syntax of the aggregate function, you could take a look at the dplyr package. Its syntax might be a bit easier for you.

To answer your questions:

The second argument is just expected to be a list, so you can add multiple variables.
You have to use train[Sales != 0] two times, because otherwise the first and the by argument look at different indices. You could also make a subset first:

Base R-code:

trainSales <- train[Sales != 0]  
aggregate( trainSales$Sales, by = list(trainSales$Store), mean )

With dplyr you could do something like this:

train %>%
    filter( Sales != 0) %>%
    group_by( Store ) %>%
    summarise_each( funs(mean) )

You see I use summarise_each because it condenses the dataset to one row, but you could off course also do something that leaves all the rows intact (in that case, use do).

aggregate function is not intuitive

Answers (1)

Related Questions