Reputation: 879
I want to add a new column to my data.table. This column should contain the sum of another column of all rows that satisfy a certain condition. An example: My data.table looks like this:
require(data.table)
DT <- data.table(n=c("a", "a", "a", "a", "a", "a", "b", "b", "b"),
t=c(10, 20, 33, 40, 50, 22, 25, 34, 11),
v=c(20, 15, 16, 17, 11, 12, 20, 22, 10)
)
DT
n t v
1: a 10 20
2: a 20 15
3: a 33 16
4: a 40 17
5: a 50 11
6: a 22 12
7: b 25 20
8: b 34 22
9: b 11 10
For every row x and every row i, where abs(t[i] - t[x]) <= 10, I want to calculate
foo = sum( v[i] * abs(t[i] - t[x]) )
In SQL I would solve this using a self join. In R I was able to do this using a for loop:
for (i in 1:nrow(DT))
DT[i, foo:=DT[n==DT[i]$n & abs(t-DT[i]$t)<=10, sum(v * abs(t-DT[i]$t) )]]
DT
n t v foo
1: a 10 20 150
2: a 20 15 224
3: a 33 16 119
4: a 40 17 222
5: a 50 11 170
6: a 22 12 30
7: b 25 20 198
8: b 34 22 180
9: b 11 10 0
Unfortunately I have to do this quite often and the table I work with is rather larger. The for-loop approach works but is too slow. I played around with the sqldf package, with no real breakthrough. I would love to do this using some data.table magic and there I need your help :-). I think what is needed is some kind of self join on the condition that the difference of the t values is smaller then the threshold.
Follow up: I have a follow up question: In my application this join is done over and over again. The v's change, but the t's and the n's are always the same. So I am thinking about somehow storing which rows belong together. Any ideas how to do this in a clever way?
Upvotes: 15
Views: 9915
Reputation: 55350
Try the following:
unique(merge(DT, DT, by="n")[abs(t.x - t.y) <= 10, list(n, sum(v.x * abs(t.x - t.y))), by=list(t.x, v.x)])
Breakdown for the above line:
You can merge a table with itself, the output will also be a data.table. Notice that the column names will be given a suffix of .x
and .y
merge(DT, DT, by="n")
... you can just filter and calculate as with any DT
# this will give you your desired rows
[abs(t.x - t.y), ]
# this is the expression you outlined
[ ... , sum(v.x * abs(t.x - t.y)) ]
# summing by t.x and v.x
[ ... , ... , by=list(t.x, v.x)]) ]
Then finally wrapping it all in unique
to remove any duplicated rows.
UPDATE: The line below is what matches your output. The only difference between this and the one at the top of this answer is the term v.y
in sum(v.y * ...)
however the by
statement still uses v.x
. Is that intentional?
unique(merge(DT, DT, by="n")[abs(t.x - t.y) <= 10, list(n, sum(v.y * abs(t.x - t.y))), by=list(t.x, v.x)])
Upvotes: 6
Reputation: 59602
Great question. This answer is just a taster really alongside Ricardo's answer.
Ideally we want to avoid the large cartesian self join for efficiency. Unfortunately range joins (FR#203) haven't been implemented yet. In the meantime, using very latest v1.8.7 (untested) :
setkey(DT,n,t)
DT[,from:=DT[.(n,t-10),which=TRUE,roll=-Inf,rollends=TRUE]]
DT[,to:=DT[.(n,t+10),which=TRUE,roll=+Inf,rollends=TRUE]]
DT[,foo:=0L]
for (i in 1:nrow(DT)) {
s = seq.int(DT$from[i],DT$to[i])
set(DT, i, "foo", DT[,sum(v[s]*abs(t[s]-t[i]))] )
}
Once FR#203 is done, the logic above would be built in, and it should become a one liner :
setkey(DT,n,t)
DT[.(n,.(t-10,t+10),t), foo:=sum(v*abs(t-i.t))]
The second column of the i
table there is a 2-column column (indicating a between join). That should be fast because, as usual, j
would be evaluated for each row of i
without needing to create a huge cartesian self join table.
That's the current thinking, anyway.
Upvotes: 12