Reputation: 297
lets say we have the following.:
time=c(20060200,20060200,20060200,20060200,20060200,20060300,20060400,20060400,20060400)
bucket=c(1,1,2,2,1,3,3,3,1)
rate=c(0.05,0.04,0.04,0.05,0.06,0.01,0.07,0.08,0.03)
time bucket rate
1: 20060200 1 0.05
2: 20060200 1 0.04
3: 20060200 2 0.04
4: 20060200 2 0.05
5: 20060200 1 0.06
6: 20060300 3 0.01
7: 20060400 3 0.07
8: 20060400 3 0.08
9: 20060400 1 0.03
i know how to aggregate the rate to time or bucket by something like this
test=data.table(time,bucket,rate)
b=test[,list(x=sum(rate)),by=bucket]
my question is how to aggregate to the bucket, while keeping the time intact.
so what i want is something like this:
20060200 1 0.15
20060200 2 0.09
20060200 3 0
20060300 1 0
20060300 2 0
20060300 3 0.01
20060400 1 0.03
20060400 2 0
20060400 3 0.15
hope this is clear, thanks
Upvotes: 0
Views: 149
Reputation: 66819
As @Mittenchops said, you're looking for the Cartesian product. There's a function for this, CJ
. You can get the combos you want with unique(CJ(time,bucket))
. To use this with your data.table, you can (i) set the key and (ii) join it with the CJ:
setkey(test,time,bucket)
b <- test[unique(CJ(time,bucket)),list(x=sum(rate))]
b[is.na(x),x:=0]
The last step sets missing values to 0. The result is:
time bucket x
1: 20060200 1 0.15
2: 20060200 2 0.09
3: 20060200 3 0.00
4: 20060300 1 0.00
5: 20060300 2 0.00
6: 20060300 3 0.01
7: 20060400 1 0.03
8: 20060400 2 0.00
9: 20060400 3 0.15
By the way, when you "join" using x[y,...]
syntax (where x and y are both data.tables), there is a hidden by
...a by-without-by... on (possibly only the first part of) x's key. Look up "by-without-by" in the documentation or on google for details.
Upvotes: 5
Reputation: 19724
It sounds like the thing that makes your question difficult is less about aggregating, and more about creating the cartesian product of times by groups to fill in the gaps left by the aggregate. It would be great if there were a flag in the function to accomplish this, but there doesn't seem to be.
So, this isn't elegant, but here's one solution that accomplishes that, by constructing that structure, then grafting the results of aggregation onto that scaffolding:
df <- aggregate(rate~., data=test, sum)
> df
time bucket rate
1 20060200 1 0.15
2 20060400 1 0.03
3 20060200 2 0.09
4 20060300 3 0.01
5 20060400 3 0.15
Figure out what levels we need to create our Cartesian scaffolding, in this case, all times by all groups:
> levels(factor(bucket))
[1] "1" "2" "3"
> levels(factor(time))
[1] "20060200" "20060300" "20060400"
> B <- levels(factor(bucket))
> t <- levels(factor(time))
Make a lattice base to graft the results onto:
> base <- expand.grid(B,t)
> names(base) <-c("bucket","time")
> base
bucket time
1 1 20060200
2 2 20060200
3 3 20060200
4 1 20060300
5 2 20060300
6 3 20060300
7 1 20060400
8 2 20060400
9 3 20060400
Merge the dataframe onto the base:
> m <- merge(base,df,all.x=T)
bucket time rate
1 1 20060200 0.15
2 1 20060300 NA
3 1 20060400 0.03
4 2 20060200 0.09
5 2 20060300 NA
6 2 20060400 NA
7 3 20060200 NA
8 3 20060300 0.01
9 3 20060400 0.15
Replace NA with 0s:
m$rate[is.na(m$rate)] <- 0
> m
bucket time rate
1 1 20060200 0.15
2 1 20060300 0.00
3 1 20060400 0.03
4 2 20060200 0.09
5 2 20060300 0.00
6 2 20060400 0.00
7 3 20060200 0.00
8 3 20060300 0.01
9 3 20060400 0.15
Sort to get your desired output:
> m[with(m,order(time,bucket)),]
bucket time rate
1 1 20060200 0.15
4 2 20060200 0.09
7 3 20060200 0.00
2 1 20060300 0.00
5 2 20060300 0.00
8 3 20060300 0.01
3 1 20060400 0.03
6 2 20060400 0.00
9 3 20060400 0.15
Upvotes: 0