Reputation: 501
Below is a part of my rather large table called "input":
[,1] [,2] [,3]
[7146,] 20100324 7.70 4.0000000
[7147,] 20100324 2.22 0.0000000
[7148,] 20100325 2.12 0.0000000
[7149,] 20100326 2.29 0.0000000
[7150,] 20100327 2.10 0.0000000
[7151,] 20100328 2.26 2.0000000
[7152,] 20100328 2.01 1.6000000
[7153,] 20100328 2.17 0.0000000
[7154,] 20100329 1.92 0.0000000
[7155,] 20100330 2.15 0.0000000
What I am trying to do is as follows:
I want to aggregate the rows that have the same date (dates are stated in column [,1]) and sum the values of these rows in columns [,2] and [,3] divided by the number of rows that are aggregated.
The output would be something like this:
[,1] [,2] [,3]
[1,] 20100324 4.96 2.0000000 # e.g: [1,2] = (input[7146,2] + input[7147,2])/2 = (7.70
[2,] 20100325 2.12 0.0000000 + 2.22)/2 = 4.96
[3,] 20100326 2.29 0.0000000
[4,] 20100327 2.10 0.0000000
[5,] 20100328 2.15 1.2000000
[6,] 20100329 1.92 0.0000000
[7,] 20100330 2.15 0.0000000
Help would be very much appreciated!
Upvotes: 0
Views: 4732
Reputation: 98419
Your sampla data
df<-read.table(text="
20100324 7.70 4.0000000
20100324 2.22 0.0000000
20100325 2.12 0.0000000
20100326 2.29 0.0000000
20100327 2.10 0.0000000
20100328 2.26 2.0000000
20100328 2.01 1.6000000
20100328 2.17 0.0000000
20100329 1.92 0.0000000
20100330 2.15 0.0000000")
One way is to use function ddply()
and then calculate colMeans()
for each column except first, that is used to split data.
library(plyr)
ddply(df,.(V1),colMeans)
V1 V2 V3
1 20100324 4.960000 2.0
2 20100325 2.120000 0.0
3 20100326 2.290000 0.0
4 20100327 2.100000 0.0
5 20100328 2.146667 1.2
6 20100329 1.920000 0.0
7 20100330 2.150000 0.0
The same result can be achieved with aggregate()
.
aggregate(.~V1,data=df,mean)
V1 V2 V3
1 20100324 4.960000 2.0
2 20100325 2.120000 0.0
3 20100326 2.290000 0.0
4 20100327 2.100000 0.0
5 20100328 2.146667 1.2
6 20100329 1.920000 0.0
7 20100330 2.150000 0.0
Third options is to use advantages of package data.table
, especially if you have large data frame.
library(data.table)
#Convert your data frame to data table and set column V1 as key.
dt<-data.table(df,key="V1")
#Calculate mean for each column .SD means subset of your data table
dt[,lapply(.SD,mean),by=V1]
V1 V2 V3
1: 20100324 4.960000 2.0
2: 20100325 2.120000 0.0
3: 20100326 2.290000 0.0
4: 20100327 2.100000 0.0
5: 20100328 2.146667 1.2
6: 20100329 1.920000 0.0
7: 20100330 2.150000 0.0
Upvotes: 2