rafa.pereira
rafa.pereira

Reputation: 13807

Aggregate data.table based on condition in another row

I want to aggregate a data.table based on two conditions, one of which is attached to another row. Here is my problem and a reproducible example:

I have a pair of origins-destinations. For each origin, I want to sum the points in the destinations given condition1 is satisfied. However, there are two tricky issues.

  1. The points in each origin-destination pair can only be summed once
  2. The points should only be summed up IF condition2 is satisfied in the reverse flux. That is, points in A-B can only be summed if condition1==T AND if there is a B-A pair where condition2==T

Reproducible example:

library(data.table)
dt <-  data.table( origin = c("A", "A", "A", "A", "A", "A", "B", "B", "A", "A", "C", "C", "B", "B", "B", "B", "B", "C", "C", "B", "A", "C", "C", "C", "C", "C", "A", "A", "C", "C", "B", "B"),
                   destination = c("A", "A", "A", "A", "B", "B", "A", "A", "C", "C", "A", "A", "B", "B", "B", "C", "C", "B", "B", "A", "B", "C", "C", "C", "A", "A", "C", "C", "B", "B", "C", "C"),
                   points_in_dest = c(5, 5, 5, 5, 4, 4, 5, 5, 3, 3, 5, 5, 4, 4, 4, 3, 3, 4, 4, 5, 4, 3, 3, 3, 5,5, 3, 3, 4, 4, 3, 3),
                   depart_time = c(7, 8, 16, 18, 7, 8, 16, 18, 7, 8, 16, 18, 7, 8, 16, 7, 8, 16, 18, 8, 16, 7, 8, 18, 7, 8, 16, 18, 7, 8, 16, 18),   
                   travel_time = c(0, 0, 0, 0, 70, 10, 70, 10, 10, 10, 70, 70, 0, 0, 0, 70, 10, 10, 70, 70, 10, 0, 0, 0, 10, 70, 10, 70, 10, 70, 70, 10) )

 dt[ depart_time<=8  & travel_time < 60, condition1 := T] # condition 1 - trips must be in the morning and shorter than 60 min
 dt[ depart_time>=16 & travel_time < 60, condition2 := T] # condition 2 - trips must be in the afternoon and shorter than 60 min

If I sum the points considering only condition1, this is what I get. Note this query does not deal with two issues: (1) It is double counting points when there is more than one origin-destination pair that satisfies condition1, (2) It is not excluding the points when condition2 is not satisfied

dt[ condition1==T, .(poits = sum(points_in_dest)), by=.(origin)]

>    origin poits
> 1:      A    20
> 2:      B    11
> 3:      C    15

Desired output

>    origin poits
> 1:      A     9
> 2:      B     7
> 3:      C    12

My real data frame is ~80 million rows, so I would appreciate an efficient solution, likely based on data.table. I realize this is a tricky problem and I would appreciate any help. thanks in advance

Background

This is a common problem in time-geography of accessibility with space-time constraints. The question is how many jobs opportunities you choose from given your space-time constraints and that you live in block A, for example. There are 5 jobs in block A, 4 jobs in B and 3 jobs in block C and in you are qualified to work in all of them. However, you can only work in a job position if you can get to the office in the morning (condition1) AND if you can be back at home after 4pm (condition2).

Upvotes: 2

Views: 493

Answers (1)

David Arenburg
David Arenburg

Reputation: 92282

As you only want to count every combination once, I would suggest to do an opposite join (destination to origin and origin to destination) on the unique subsets by both conditions and then simply sum the points by origin like you already doing.

I came across a bug in data.table while solving this, hence the setattr(res, "sorted", NULL) line (which will remove the keys). This workaround won't affect performance. I've filled a bug report.

setkey(dt, origin, destination) ## doing this so the `unique` function will work faster
res <- unique(dt[(condition1)])[unique(dt[(condition2)]), 
                                on = c(destination = "origin", origin = "destination"), 
                                nomatch = 0L]
setattr(res, "sorted", NULL) ### Fixing the bug
res[, .(points = sum(points_in_dest)), keyby = origin]
#    origin points
# 1:      A      9
# 2:      B      7
# 3:      C     12

Upvotes: 3

Related Questions