Reputation: 173
I have a large data.table where I for each row need to make computations based on part of the full data.table. As an example consider the following data.table, and assume I for each row want to compute the sum of the num variable for every rows where id2 matches id1 for the current row as well as the time variable is within distance 1 from the time of the current row.
set.seed(123)
dat <- data.table(cbind(id1=sample(1:5,10,replace=T),
id2=sample(1:5,10,replace=T),
num=sample(1:10,10,replace=T),
time=sample(1:10,10,replace=T)))
This could easily be done by looping over each row like this
dat[,val:= 0]
for (i in 1:nrow(dat)){
this.val <- dat[ (id2==id1[i]) & (time>=time[i]-2) & (time<=time[i]+2),sum(num)]
dat[i,val:=this.val]
}
dat
The resulting data.table looks like this:
> dat
id1 id2 num time val
1: 2 5 9 10 6
2: 4 3 7 10 0
3: 3 4 7 7 10
4: 5 3 10 8 9
5: 5 1 7 1 2
6: 1 5 8 5 6
7: 3 2 6 8 17
8: 5 1 6 3 10
9: 3 2 3 4 0
10: 3 5 2 3 0
What is the proper/fast way to do things like this using data.table?
Upvotes: 12
Views: 316
Reputation: 886938
We can use a self-join here by creating the 'timeminus2' and 'timeplus2' column, join on
by 'id2' with 'id1' and the non-equi logical condition to get the sum
of 'num' and assign (:=
) the 'val' column to the original dataset
tmp <- dat[.(id1 = id1, timeminus2 = time - 2, timeplus2 = time + 2),
.(val = sum(num)),
on = .(id2 = id1, time >= timeminus2, time <= timeplus2),
by = .EACHI
][is.na(val), val := 0][]
dat[, val := tmp$val][]
# id1 id2 num time val
# 1: 2 5 9 10 6
# 2: 4 3 7 10 0
# 3: 3 4 7 7 10
# 4: 5 3 10 8 9
# 5: 5 1 7 1 2
# 6: 1 5 8 5 6
# 7: 3 2 6 8 17
# 8: 5 1 6 3 10
# 9: 3 2 3 4 0
#10: 3 5 2 3 0
Upvotes: 5