Dnaiel
Dnaiel

Reputation: 7832

Collapse rows in a data frame using R

I have a data frame in R defined as follows:

data frame:

col 1  col 2  col 3 col4 
200    AIG   8.5   12   
800    AIG   8.1   20.1   
500    A1B   20   50.5   
800    A1B   12   30   
120    A2M   1.6   8.5   

dat <- structure(list(col1 = c(200, 800, 500, 800, 120), col2 = structure(c(3L, 
    3L, 1L, 1L, 2L), .Label = c("A1B", "A2M", "AIG"), class = "factor"), 
        col3 = c(8.5, 8.1, 20, 12, 1.6), col4 = c(12, 20.1, 50.5, 
        30, 8.5)), .Names = c("col1", "col2", "col3", "col4"), row.names = c(NA, 
    -5L), class = "data.frame")

Then I'd like to collapse the rows by id (in this case the unique ids are A1G, A1B, A2M).
Col 1, I'd like to collapse it by adding the rows with the same id.
Col 2, I'd like to collapse it to each unique id
Col 3, I'd like to collapse it as follows, take col1*col3, add them, and then divide them by the sum of col1.
I.e., the A1G new row value should be (8.5*20+8.1*80)/(80+20). Aka the weighted average of column 3 weighted by the values of col1.
Col 4, I'd like to take the maximum value.

The resulting data frame should look like:

column 1  column 2  column 3 column 4 
800+200=1000    AIG   (8.5*200+8.1*800)/1000=8.18   max(12,20.1)=20.1   
800+500=1300    AIB   (20*800+12*500)/1300=16.9   max(50.5, 30)=50.5   
120    A2M   1.6   8.5   

Any suggestions?

Upvotes: 13

Views: 24386

Answers (4)

Tyler Rinker
Tyler Rinker

Reputation: 109844

A base solution but I like the data.table solution:

dat[, 2] <- factor(dat[, 2], levels=unique(dat[, 2])) #in case not already ordered
L1 <- split(dat, dat$col2)                            #split into list by col2

funny <- function(x){                                 #function to calculate stuff
    x <- data.frame(x)
    c(col1=sum(x[,1]), col2=as.character(x[1, 2]), 
        col3=sum((x[, 3]*x[, 1]))/sum(x[, 1]),
        col4=max(x[,4]))
}

#apply function and wrap it up into dataframe
dat2 <- data.frame(do.call(rbind, lapply(L1, funny)), row.names=NULL) 
dat2[, -2] <- apply(dat2[, -2], 2, as.numeric)       #reapply classes    
dat2

#> dat2
#  col1 col2     col3 col4
#1 1000  AIG  8.18000 20.1
#2 1300  A1B 15.07692 50.5
#3  120  A2M  1.60000  8.5

Upvotes: 1

nograpes
nograpes

Reputation: 18323

A solution in base:

dat2<-do.call(rbind,
  by(dat,dat$col2, function(x) 
    with (x,
     data.frame(
       col1 = sum(col1),
       col3 = sum(col1 * col3) / sum(col1),
       col4 = max(col4)
     )
    )
  )
)
dat2$col2<-rownames(dat2)

#     col1     col3 col4 col2
# A1B 1300 15.07692 50.5  A1B
# A2M  120  1.60000  8.5  A2M
# AIG 1000  8.18000 20.1  AIG

Upvotes: 10

GSee
GSee

Reputation: 49810

Here is a data.table solution that will scale well for big data (speed and memory efficient)

library(data.table)
DT <- data.table(dat, key="col2")
DT[, list(col1=sum(col1), 
          col3=sum(col1 * col3) / sum(col1), 
          col4=max(col4)), by=col2]
#   col2 col1     col3 col4
#1:  A1B 1300 15.07692 50.5
#2:  A2M  120  1.60000  8.5
#3:  AIG 1000  8.18000 20.1

Upvotes: 20

flodel
flodel

Reputation: 89057

Using the plyr package:

library(plyr)
ddply(df, "col2", summarize, col1 = sum(col1),
                             col3 = sum(col1 * col3) / sum(col1),
                             col4 = max(col4))
#   col2 col1     col3 col4
# 1  A1B 1300 15.07692 50.5
# 2  A2M  120  1.60000  8.5
# 3  AIG 1000  8.18000 20.1

Upvotes: 6

Related Questions