MB123
MB123

Reputation: 501

row aggregation according to date in R and sum corresponding other column values divided by number of aggregated rows

Below is a part of my rather large table called "input":

        [,1]     [,2]   [,3]             
[7146,] 20100324 7.70   4.0000000
[7147,] 20100324 2.22   0.0000000
[7148,] 20100325 2.12   0.0000000
[7149,] 20100326 2.29   0.0000000
[7150,] 20100327 2.10   0.0000000
[7151,] 20100328 2.26   2.0000000
[7152,] 20100328 2.01   1.6000000
[7153,] 20100328 2.17   0.0000000
[7154,] 20100329 1.92   0.0000000
[7155,] 20100330 2.15   0.0000000

What I am trying to do is as follows:

I want to aggregate the rows that have the same date (dates are stated in column [,1]) and sum the values of these rows in columns [,2] and [,3] divided by the number of rows that are aggregated.

The output would be something like this:

        [,1]  [,2]   [,3]             
[1,] 20100324 4.96   2.0000000 # e.g: [1,2] = (input[7146,2] + input[7147,2])/2 = (7.70 
[2,] 20100325 2.12   0.0000000                + 2.22)/2 = 4.96 
[3,] 20100326 2.29   0.0000000
[4,] 20100327 2.10   0.0000000
[5,] 20100328 2.15   1.2000000
[6,] 20100329 1.92   0.0000000
[7,] 20100330 2.15   0.0000000

Help would be very much appreciated!

Upvotes: 0

Views: 4732

Answers (1)

Didzis Elferts
Didzis Elferts

Reputation: 98419

Your sampla data

df<-read.table(text="           
 20100324 7.70   4.0000000
 20100324 2.22   0.0000000
 20100325 2.12   0.0000000
 20100326 2.29   0.0000000
 20100327 2.10   0.0000000
 20100328 2.26   2.0000000
 20100328 2.01   1.6000000
 20100328 2.17   0.0000000
 20100329 1.92   0.0000000
 20100330 2.15   0.0000000")

One way is to use function ddply() and then calculate colMeans() for each column except first, that is used to split data.

library(plyr)
ddply(df,.(V1),colMeans)
        V1       V2  V3
1 20100324 4.960000 2.0
2 20100325 2.120000 0.0
3 20100326 2.290000 0.0
4 20100327 2.100000 0.0
5 20100328 2.146667 1.2
6 20100329 1.920000 0.0
7 20100330 2.150000 0.0

The same result can be achieved with aggregate().

aggregate(.~V1,data=df,mean)
        V1       V2  V3
1 20100324 4.960000 2.0
2 20100325 2.120000 0.0
3 20100326 2.290000 0.0
4 20100327 2.100000 0.0
5 20100328 2.146667 1.2
6 20100329 1.920000 0.0
7 20100330 2.150000 0.0

Third options is to use advantages of package data.table, especially if you have large data frame.

 library(data.table)
#Convert your data frame to data table and set column V1 as key.
 dt<-data.table(df,key="V1")
#Calculate mean for each column .SD means subset of your data table
 dt[,lapply(.SD,mean),by=V1]
         V1       V2  V3
1: 20100324 4.960000 2.0
2: 20100325 2.120000 0.0
3: 20100326 2.290000 0.0
4: 20100327 2.100000 0.0
5: 20100328 2.146667 1.2
6: 20100329 1.920000 0.0
7: 20100330 2.150000 0.0

Upvotes: 2

Related Questions