Reputation: 3911
Like the following source code about aggregation function, I can't understand why we have to use list function() in here. Rather than I want to replace this with using one column that is needs to be grouped by. And I don't know why we use the same dataset like 'train[Sales != 0]' twice? What if I use other dataset as a second dataset param? I think it will make change to be fairly high possible mistake.
aggregate(train[Sales != 0]$Sales,
by = list(train[Sales != 0]$Store), mean)
Maybe one who can say this is wrong use case. But I also saw this source code in R Documentation
## Compute the averages for the variables in 'state.x77', grouped
## according to the region (Northeast, South, North Central, West) that
## each state belongs to.
aggregate(state.x77, list(Region = state.region), mean)
Thanks for reading my question.
Upvotes: 0
Views: 129
Reputation: 555
First of all, if you don't like the syntax of the aggregate function, you could take a look at the dplyr
package. Its syntax might be a bit easier for you.
To answer your questions:
train[Sales != 0]
two times, because otherwise the first and the by
argument look at different indices. You could also make a subset first:Base R-code:
trainSales <- train[Sales != 0]
aggregate( trainSales$Sales, by = list(trainSales$Store), mean )
With dplyr
you could do something like this:
train %>%
filter( Sales != 0) %>%
group_by( Store ) %>%
summarise_each( funs(mean) )
You see I use summarise_each
because it condenses the dataset to one row, but you could off course also do something that leaves all the rows intact (in that case, use do
).
Upvotes: 1