verystrongjoe
verystrongjoe

Reputation: 3911

aggregate function is not intuitive

Like the following source code about aggregation function, I can't understand why we have to use list function() in here. Rather than I want to replace this with using one column that is needs to be grouped by. And I don't know why we use the same dataset like 'train[Sales != 0]' twice? What if I use other dataset as a second dataset param? I think it will make change to be fairly high possible mistake.

aggregate(train[Sales != 0]$Sales, 
               by = list(train[Sales != 0]$Store), mean)

Maybe one who can say this is wrong use case. But I also saw this source code in R Documentation

## Compute the averages for the variables in 'state.x77', grouped
## according to the region (Northeast, South, North Central, West) that
## each state belongs to.
aggregate(state.x77, list(Region = state.region), mean)

Thanks for reading my question.

Upvotes: 0

Views: 129

Answers (1)

Jasper
Jasper

Reputation: 555

First of all, if you don't like the syntax of the aggregate function, you could take a look at the dplyr package. Its syntax might be a bit easier for you.

To answer your questions:

  1. The second argument is just expected to be a list, so you can add multiple variables.
  2. You have to use train[Sales != 0] two times, because otherwise the first and the by argument look at different indices. You could also make a subset first:

Base R-code:

trainSales <- train[Sales != 0]  
aggregate( trainSales$Sales, by = list(trainSales$Store), mean )

With dplyr you could do something like this:

train %>%
    filter( Sales != 0) %>%
    group_by( Store ) %>%
    summarise_each( funs(mean) )

You see I use summarise_each because it condenses the dataset to one row, but you could off course also do something that leaves all the rows intact (in that case, use do).

Upvotes: 1

Related Questions