Reputation: 6659

Sum by distinct column value in R

I have a very large dataframe in R and would like to sum two columns for every distinct value in other columns, for example say we had data of a dataframe of transactions in various shops over a day as follows

shop <- data.frame('shop_id' = c(1, 1, 1, 2, 3, 3), 
  'shop_name' = c('Shop A', 'Shop A', 'Shop A', 'Shop B', 'Shop C', 'Shop C'), 
  'city' = c('London', 'London', 'London', 'Cardiff', 'Dublin', 'Dublin'), 
  'sale' = c(12, 5, 9, 15, 10, 18), 
  'profit' = c(3, 1, 3, 6, 5, 9))

which is:

shop_id  shop_name    city      sale profit
   1     Shop A       London    12   3
   1     Shop A       London    5    1
   1     Shop A       London    9    3
   2     Shop B       Cardiff   15   6
   3     Shop C       Dublin    10   5
   3     Shop C       Dublin    18   9

And I'd want to sum the sale and profit for each shop to give:

shop_id  shop_name    city      sale profit
   1     Shop A       London    26   7
   2     Shop B       Cardiff   15   6
   3     Shop C       Dublin    28   14

I am currently using the following code to do this:

 shop_day <-ddply(shop, "shop_id", transform, sale=sum(sale), profit=sum(profit))
 shop_day <- subset(shop_day, !duplicated(shop_id))

which works absolutely fine, but as I said my dataframe is large (140,000 rows, 37 columns and nearly 100,000 unique rows which I want to sum) and my code takes ages to run and then eventually says it has run out of memory.

Does anyone know of the most efficient way to do this.

Thanks in advance!

Upvotes: 13

Answers (5)

Ada

Reputation: 1

sorry my english is not very good.

I have a data like this

group X A 2 A 1 C 1 B 5 A 2 C 1 C 2 B 5 B 5

I want to have a table who gives me

The sum of unique values by group like this:

group X

A 3

B 5

C 3

Upvotes: 0

Abhishek

Reputation: 483

Just in case, if you have long list of columns, use summarize_if()

Summarizes all columns if data type is int

library(dplyr)
shop %>% 
  group_by(shop_id, shop_name, city) %>% 
  summarise_if(is.integer, sum)

Upvotes: 2

user1165199

Reputation: 6659

I think the neatest way to do this is in dplyr

library(dplyr)
shop %>% 
  group_by(shop_id, shop_name, city) %>% 
  summarise_all(sum)

Upvotes: 6

Aaron - mostly inactive

Reputation: 37824

Here's how to use base R to speed up operations like this:

idx <- split(1:nrow(shop), shop$shop_id)
a2 <- data.frame(shop_id=sapply(idx, function(i) shop$shop_id[i[1]]),
                 sale=sapply(idx, function(i) sum(shop$sale[i])), 
                 profit=sapply(idx, function(i) sum(shop$profit[i])) )

Time reduces to 0.75 sec vs 5.70 sec for the ddply summarise version on my system.

Upvotes: 4

Justin

Reputation: 43265

** Obligatory Data Table answer **

> library(data.table)
data.table 1.8.0  For help type: help("data.table")
> shop.dt <- data.table(shop)
> shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id']
     shop_id sale profit
[1,]       1   26      7
[2,]       2   15      6
[3,]       3   28     14
>

Which sounds fine and good until things get bigger...

shop <- data.frame(shop_id = letters[1:10], profit=rnorm(1e7), sale=rnorm(1e7))
shop.dt <- data.table(shop)

> system.time(ddply(shop, .(shop_id), summarise, sale=sum(sale), profit=sum(profit)))
   user  system elapsed 
  4.156   1.324   5.514 
> system.time(shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id'])
   user  system elapsed 
  0.728   0.108   0.840 
>

You get additional speed increases if you create the data.table with a key:

shop.dt <- data.table(shop, key='shop_id')

> system.time(shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id'])
   user  system elapsed 
  0.252   0.084   0.336 
>

Upvotes: 18

Sum by distinct column value in R

Answers (5)

Summarizes all columns if data type is int

Related Questions