Reputation: 6659
I have a very large dataframe in R and would like to sum two columns for every distinct value in other columns, for example say we had data of a dataframe of transactions in various shops over a day as follows
shop <- data.frame('shop_id' = c(1, 1, 1, 2, 3, 3),
'shop_name' = c('Shop A', 'Shop A', 'Shop A', 'Shop B', 'Shop C', 'Shop C'),
'city' = c('London', 'London', 'London', 'Cardiff', 'Dublin', 'Dublin'),
'sale' = c(12, 5, 9, 15, 10, 18),
'profit' = c(3, 1, 3, 6, 5, 9))
which is:
shop_id shop_name city sale profit
1 Shop A London 12 3
1 Shop A London 5 1
1 Shop A London 9 3
2 Shop B Cardiff 15 6
3 Shop C Dublin 10 5
3 Shop C Dublin 18 9
And I'd want to sum the sale and profit for each shop to give:
shop_id shop_name city sale profit
1 Shop A London 26 7
2 Shop B Cardiff 15 6
3 Shop C Dublin 28 14
I am currently using the following code to do this:
shop_day <-ddply(shop, "shop_id", transform, sale=sum(sale), profit=sum(profit))
shop_day <- subset(shop_day, !duplicated(shop_id))
which works absolutely fine, but as I said my dataframe is large (140,000 rows, 37 columns and nearly 100,000 unique rows which I want to sum) and my code takes ages to run and then eventually says it has run out of memory.
Does anyone know of the most efficient way to do this.
Thanks in advance!
Upvotes: 13
Views: 35806
Reputation: 1
sorry my english is not very good.
I have a data like this
group X A 2 A 1 C 1 B 5 A 2 C 1 C 2 B 5 B 5
I want to have a table who gives me
The sum of unique values by group like this:
group X
A 3
B 5
C 3
Upvotes: 0
Reputation: 483
Just in case, if you have long list of columns, use summarize_if()
library(dplyr)
shop %>%
group_by(shop_id, shop_name, city) %>%
summarise_if(is.integer, sum)
Upvotes: 2
Reputation: 6659
I think the neatest way to do this is in dplyr
library(dplyr)
shop %>%
group_by(shop_id, shop_name, city) %>%
summarise_all(sum)
Upvotes: 6
Reputation: 37824
Here's how to use base R to speed up operations like this:
idx <- split(1:nrow(shop), shop$shop_id)
a2 <- data.frame(shop_id=sapply(idx, function(i) shop$shop_id[i[1]]),
sale=sapply(idx, function(i) sum(shop$sale[i])),
profit=sapply(idx, function(i) sum(shop$profit[i])) )
Time reduces to 0.75 sec vs 5.70 sec for the ddply summarise version on my system.
Upvotes: 4
Reputation: 43265
** Obligatory Data Table answer **
> library(data.table)
data.table 1.8.0 For help type: help("data.table")
> shop.dt <- data.table(shop)
> shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id']
shop_id sale profit
[1,] 1 26 7
[2,] 2 15 6
[3,] 3 28 14
>
Which sounds fine and good until things get bigger...
shop <- data.frame(shop_id = letters[1:10], profit=rnorm(1e7), sale=rnorm(1e7))
shop.dt <- data.table(shop)
> system.time(ddply(shop, .(shop_id), summarise, sale=sum(sale), profit=sum(profit)))
user system elapsed
4.156 1.324 5.514
> system.time(shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id'])
user system elapsed
0.728 0.108 0.840
>
You get additional speed increases if you create the data.table with a key:
shop.dt <- data.table(shop, key='shop_id')
> system.time(shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id'])
user system elapsed
0.252 0.084 0.336
>
Upvotes: 18