Reputation: 1545
I have merged two datasets. Set X contains an ID and a startDate, set Y an ID and an endDate. Both have the ID as key. I want to match every startDate with its corresponding endDate (provided it is bigger than the startDate). The problem is that IDs may appear multiple times in a given set, but not always as many times in the same set. Hence, a startDate can get matched up with multiple entries of endDate, and vice versa. This is basically the code I have:
require(data.table)
startDate = c(1,2,3,1)
IDX = c(1,2,2,3)
endDate = c(2,3,4)
IDY = c(1,1,2)
X = data.table(startDate,IDX)
Y= data.table(endDate,IDY)
setkey(X,IDX)
setkey(Y,IDY)
DT = X[Y,allow.cartesian = TRUE][endDate>startDate]
Now I would like to conditionally remove duplicated entries from the set DT. Corresponding to the keys "ID" and "startDate" I want to only keep the duplicated entry with the lowest endDate. For every duplicated entry of "ID" and "endDate" I want to do the same, but instead keep the entry that has the highest startDate.
DT looks like this:
IDX startDate endDate
1: 1 1 2
2: 1 1 3
3: 2 2 4
4: 2 3 4
There are two duplicates in this new data table. 1 and 2 are duplicates of eachother, with a different endDate. Only entry 1 has to stay (since it has the closest endDate to the startDate). 3 and 4 are also duplicates of eachother, with a different startDate. Here, entry 4 has to stay (since it has the closes startDate to the endDate). So the required output is
IDX startDate endDate
1: 1 1 2
2: 2 3 4
I don't know how to achieve this with duplicated(DT)
, since it does not necessarily give the specific duplicate that I want. Anyone got a suggestion? Moreover, can I perhaps already solve this during the "join" instead of solving it afterwards?
Upvotes: 1
Views: 272
Reputation: 49448
I don't think this can be done in a single join, but perhaps the following can work:
Y[, startDate := endDate]
setkey(X, IDX, startDate)
setkey(Y, IDY, startDate)
Y[X, roll = -Inf][, list(startDate = startDate[.N]), by = list(IDY, endDate)]
# IDY endDate startDate
#1: 1 2 1
#2: 2 4 3
Upvotes: 2