Dean MacGregor
Dean MacGregor

Reputation: 18331

data.table join using two columns from one table and one column from other

I have two data tables, lets call them weights and values.
The weights table has 5 columns as follows:

first POSIXct
late POSIXct
nodeid integer
aggid integer
weight numeric

The values table has these columns

nodeid integer
Date POSIXct
hour integer
value decimal

The idea is to generate a new table where it will take the weighted average of the nodes into aggregate nodes based on the weightings. However, the weightings change through time and need to be matched based on the first and late date. SQL syntax to do this would look something like this

select v.Date, v.hour, w.aggid, sum(v.value*w.weight) as aggvalue
from values v inner join weights w
on v.nodeid=w.nodeid and v.date between w.first and w.late
group by aggid, date, hour

I'm not really sure where to start on this one given the between logic in the SQL syntax. Is this possible in data.table syntax or would I need to transform the weights table to have a row for every single day as opposed to using the range?

Here is some example data (sorry it's so long)...

values<-data.table(nodeid = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 
6L, 6L, 6L, 6L, 6L), Date = c("2013-07-02", "2013-07-02", "2013-07-05", 
"2013-07-08", "2013-07-10", "2013-07-02", "2013-07-02", "2013-07-05", 
"2013-07-08", "2013-07-10", "2013-07-02", "2013-07-02", "2013-07-05", 
"2013-07-08", "2013-07-10", "2013-07-02", "2013-07-02", "2013-07-05", 
"2013-07-08", "2013-07-10", "2013-07-02", "2013-07-02", "2013-07-05", 
"2013-07-08", "2013-07-10", "2013-07-02", "2013-07-02", "2013-07-05", 
"2013-07-08", "2013-07-10"), hour = c(1L, 2L, 23L, 2L, 2L, 1L, 
2L, 23L, 2L, 2L, 1L, 2L, 23L, 2L, 2L, 1L, 2L, 23L, 2L, 2L, 1L, 
2L, 23L, 2L, 2L, 1L, 2L, 23L, 2L, 2L), value = c(8.234, 3.218, 
0.787, 8.689, 6.218, 6.89, 1.914, 2.459, 6.683, 8.122, 0.281, 
1.136, 1.993, 7.27, 9.582, 5.777, 1.375, 9.204, 7.862, 0.633, 
2.433, 1.842, 7.178, 10.692, 1.417, 1.259, 2.619, 0.031, 6.744, 
5.941))

weights<-data.table(first = c("2013-07-01", "2013-07-01", "2013-07-01", 
"2013-07-01", "2013-07-01", "2013-07-01", "2013-07-08", "2013-07-08", 
"2013-07-08", "2013-07-08", "2013-07-08", "2013-07-08"), late = c("2013-07-07", 
"2013-07-07", "2013-07-07", "2013-07-07", "2013-07-07", "2013-07-07", 
"2013-07-20", "2013-07-20", "2013-07-20", "2013-07-20", "2013-07-20", 
"2013-07-20"), nodeid = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 
4L, 5L, 6L), aggid = c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 
2L, 2L), weight = c(0.5, 0.25, 0.25, 0.3, 0.5, 0.2, 0.6, 0.2, 
0.2, 0.4, 0.45, 0.15))

exresults<-data.table(aggid = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L), Date = c("2013-07-02", "2013-07-02", "2013-07-02", "2013-07-02", 
"2013-07-05", "2013-07-05", "2013-07-08", "2013-07-08", "2013-07-10", 
"2013-07-10"), hour = c(1L, 1L, 2L, 2L, 23L, 23L, 2L, 2L, 2L, 
2L), aggvalue = c(5.90975, 3.2014, 2.3715, 1.8573, 1.5065, 6.3564, 
8.004, 8.9678, 7.2716, 1.782))

Upvotes: 2

Views: 350

Answers (1)

eddi
eddi

Reputation: 49448

Use the roll param of data.table joins:

setkey(values, nodeid, Date)
setkey(weights, nodeid, late)

weights[values, roll = -Inf][, list(aggvalue = sum(weight*value)),
                               by = list(aggid, Date = late, hour)]
#    aggid       Date hour aggvalue
# 1:     1 2013-07-02    1  5.90975
# 2:     1 2013-07-02    2  2.37150
# 3:     1 2013-07-05   23  1.50650
# 4:     1 2013-07-08    2  8.00400
# 5:     1 2013-07-10    2  7.27160
# 6:     2 2013-07-02    1  3.20140
# 7:     2 2013-07-02    2  1.85730
# 8:     2 2013-07-05   23  6.35640
# 9:     2 2013-07-08    2  8.96780
#10:     2 2013-07-10    2  1.78200

Note: I'd be careful if the correct range doesn't exist - I didn't test that edge case.

Upvotes: 2

Related Questions