Reputation: 18331
I have two data tables, lets call them weights
and values
.
The weights
table has 5 columns as follows:
first POSIXct
late POSIXct
nodeid integer
aggid integer
weight numeric
The values
table has these columns
nodeid integer
Date POSIXct
hour integer
value decimal
The idea is to generate a new table where it will take the weighted average of the nodes into aggregate nodes based on the weightings. However, the weightings change through time and need to be matched based on the first and late date. SQL syntax to do this would look something like this
select v.Date, v.hour, w.aggid, sum(v.value*w.weight) as aggvalue
from values v inner join weights w
on v.nodeid=w.nodeid and v.date between w.first and w.late
group by aggid, date, hour
I'm not really sure where to start on this one given the between
logic in the SQL syntax. Is this possible in data.table syntax or would I need to transform the weights
table to have a row for every single day as opposed to using the range?
Here is some example data (sorry it's so long)...
values<-data.table(nodeid = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L,
6L, 6L, 6L, 6L, 6L), Date = c("2013-07-02", "2013-07-02", "2013-07-05",
"2013-07-08", "2013-07-10", "2013-07-02", "2013-07-02", "2013-07-05",
"2013-07-08", "2013-07-10", "2013-07-02", "2013-07-02", "2013-07-05",
"2013-07-08", "2013-07-10", "2013-07-02", "2013-07-02", "2013-07-05",
"2013-07-08", "2013-07-10", "2013-07-02", "2013-07-02", "2013-07-05",
"2013-07-08", "2013-07-10", "2013-07-02", "2013-07-02", "2013-07-05",
"2013-07-08", "2013-07-10"), hour = c(1L, 2L, 23L, 2L, 2L, 1L,
2L, 23L, 2L, 2L, 1L, 2L, 23L, 2L, 2L, 1L, 2L, 23L, 2L, 2L, 1L,
2L, 23L, 2L, 2L, 1L, 2L, 23L, 2L, 2L), value = c(8.234, 3.218,
0.787, 8.689, 6.218, 6.89, 1.914, 2.459, 6.683, 8.122, 0.281,
1.136, 1.993, 7.27, 9.582, 5.777, 1.375, 9.204, 7.862, 0.633,
2.433, 1.842, 7.178, 10.692, 1.417, 1.259, 2.619, 0.031, 6.744,
5.941))
weights<-data.table(first = c("2013-07-01", "2013-07-01", "2013-07-01",
"2013-07-01", "2013-07-01", "2013-07-01", "2013-07-08", "2013-07-08",
"2013-07-08", "2013-07-08", "2013-07-08", "2013-07-08"), late = c("2013-07-07",
"2013-07-07", "2013-07-07", "2013-07-07", "2013-07-07", "2013-07-07",
"2013-07-20", "2013-07-20", "2013-07-20", "2013-07-20", "2013-07-20",
"2013-07-20"), nodeid = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L,
4L, 5L, 6L), aggid = c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,
2L, 2L), weight = c(0.5, 0.25, 0.25, 0.3, 0.5, 0.2, 0.6, 0.2,
0.2, 0.4, 0.45, 0.15))
exresults<-data.table(aggid = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L), Date = c("2013-07-02", "2013-07-02", "2013-07-02", "2013-07-02",
"2013-07-05", "2013-07-05", "2013-07-08", "2013-07-08", "2013-07-10",
"2013-07-10"), hour = c(1L, 1L, 2L, 2L, 23L, 23L, 2L, 2L, 2L,
2L), aggvalue = c(5.90975, 3.2014, 2.3715, 1.8573, 1.5065, 6.3564,
8.004, 8.9678, 7.2716, 1.782))
Upvotes: 2
Views: 350
Reputation: 49448
Use the roll
param of data.table
joins:
setkey(values, nodeid, Date)
setkey(weights, nodeid, late)
weights[values, roll = -Inf][, list(aggvalue = sum(weight*value)),
by = list(aggid, Date = late, hour)]
# aggid Date hour aggvalue
# 1: 1 2013-07-02 1 5.90975
# 2: 1 2013-07-02 2 2.37150
# 3: 1 2013-07-05 23 1.50650
# 4: 1 2013-07-08 2 8.00400
# 5: 1 2013-07-10 2 7.27160
# 6: 2 2013-07-02 1 3.20140
# 7: 2 2013-07-02 2 1.85730
# 8: 2 2013-07-05 23 6.35640
# 9: 2 2013-07-08 2 8.96780
#10: 2 2013-07-10 2 1.78200
Note: I'd be careful if the correct range doesn't exist - I didn't test that edge case.
Upvotes: 2