Reputation: 13807
I want to aggregate a data.table
based on two conditions, one of which is attached to another row. Here is my problem and a reproducible example:
I have a pair of origins-destinations. For each origin, I want to sum the points in the destinations given condition1
is satisfied. However, there are two tricky issues.
condition2
is satisfied in the reverse flux. That is, points in A-B
can only be summed if condition1==T
AND if there is a B-A
pair where condition2==T
library(data.table)
dt <- data.table( origin = c("A", "A", "A", "A", "A", "A", "B", "B", "A", "A", "C", "C", "B", "B", "B", "B", "B", "C", "C", "B", "A", "C", "C", "C", "C", "C", "A", "A", "C", "C", "B", "B"),
destination = c("A", "A", "A", "A", "B", "B", "A", "A", "C", "C", "A", "A", "B", "B", "B", "C", "C", "B", "B", "A", "B", "C", "C", "C", "A", "A", "C", "C", "B", "B", "C", "C"),
points_in_dest = c(5, 5, 5, 5, 4, 4, 5, 5, 3, 3, 5, 5, 4, 4, 4, 3, 3, 4, 4, 5, 4, 3, 3, 3, 5,5, 3, 3, 4, 4, 3, 3),
depart_time = c(7, 8, 16, 18, 7, 8, 16, 18, 7, 8, 16, 18, 7, 8, 16, 7, 8, 16, 18, 8, 16, 7, 8, 18, 7, 8, 16, 18, 7, 8, 16, 18),
travel_time = c(0, 0, 0, 0, 70, 10, 70, 10, 10, 10, 70, 70, 0, 0, 0, 70, 10, 10, 70, 70, 10, 0, 0, 0, 10, 70, 10, 70, 10, 70, 70, 10) )
dt[ depart_time<=8 & travel_time < 60, condition1 := T] # condition 1 - trips must be in the morning and shorter than 60 min
dt[ depart_time>=16 & travel_time < 60, condition2 := T] # condition 2 - trips must be in the afternoon and shorter than 60 min
If I sum the points considering only condition1
, this is what I get. Note this query does not deal with two issues: (1) It is double counting points when there is more than one origin-destination pair that satisfies condition1
, (2) It is not excluding the points when condition2
is not satisfied
dt[ condition1==T, .(poits = sum(points_in_dest)), by=.(origin)]
> origin poits
> 1: A 20
> 2: B 11
> 3: C 15
> origin poits
> 1: A 9
> 2: B 7
> 3: C 12
My real data frame is ~80 million rows, so I would appreciate an efficient solution, likely based on data.table
. I realize this is a tricky problem and I would appreciate any help. thanks in advance
This is a common problem in time-geography of accessibility with space-time constraints. The question is how many jobs opportunities you choose from given your space-time constraints and that you live in block A, for example. There are 5 jobs in block A, 4 jobs in B and 3 jobs in block C and in you are qualified to work in all of them. However, you can only work in a job position if you can get to the office in the morning (condition1
) AND if you can be back at home after 4pm (condition2
).
Upvotes: 2
Views: 493
Reputation: 92282
As you only want to count every combination once, I would suggest to do an opposite join (destination
to origin
and origin
to destination
) on the unique subsets by both conditions and then simply sum the points by origin like you already doing.
I came across a bug in data.table
while solving this, hence the setattr(res, "sorted", NULL)
line (which will remove the keys). This workaround won't affect performance. I've filled a bug report.
setkey(dt, origin, destination) ## doing this so the `unique` function will work faster
res <- unique(dt[(condition1)])[unique(dt[(condition2)]),
on = c(destination = "origin", origin = "destination"),
nomatch = 0L]
setattr(res, "sorted", NULL) ### Fixing the bug
res[, .(points = sum(points_in_dest)), keyby = origin]
# origin points
# 1: A 9
# 2: B 7
# 3: C 12
Upvotes: 3