Reputation: 16488
Consider a data.table
that contains matches between id1
, id2
for several years.
id1 year id2
1: 51557094 2003 65122111
2: 51557094 2004 65122111
3: 51557094 2005 65122111
4: 51557094 2007 65122111
5: 51557094 2008 65122111
6: 51557093 2006 65122111
For any of these matches, I want to find out the duration, together with the year the match started. If there was no data for a specific year, a match ends (and the next year, if there is data again, a new match starts).
Hence, for the sample data above, the expected output would be
id1 year id2 length
1: 51557094 2003 65122111 3
2: 51557094 2007 65122111 2
3: 51557093 2006 65122111 1
I accepted one of the answers because it brought me far enough, but notice that it was not correct. While it worked for the sample data, it will not for the following
> dtId
id1 year id2
1: 51557094 2003 65122111
2: 51557094 2004 65122111
3: 51557094 2005 65122111
4: 51557094 2007 65122111
5: 51557094 2008 65122111
6: 51557094 2006 65122112
> setkey(dtId, id1, id2, year)
> dtId[,grp := cumsum(c(1,diff(year)) > 1),by=id1]
> dtId[,list(year=year[1],length=length(year)),by=list(id1,id2,grp)]
id1 id2 grp year length
1: 51557094 65122111 0 2003 5
2: 51557094 65122112 0 2006 1
Instead, creating the match variable grp
over both id1, id2
solves the issue:
> dtId[,grp := cumsum(c(1,diff(year)) > 1),by=list(id1, id2)]
> dtId[,list(year=year[1],length=length(year)),by=list(id1,id2,grp)]
id1 id2 grp year length
1: 51557094 65122111 0 2003 3
2: 51557094 65122112 0 2006 1
3: 51557094 65122111 1 2007 2
Upvotes: 1
Views: 77
Reputation: 93813
An attempt:
dat[,grp := cumsum(c(1,diff(year)) > 1),by=list(id1,id2)]
dat[,list(year=year[1],length=length(year)),by=list(id1,id2,grp)]
# id1 id2 grp year length
#1: 51557094 65122111 0 2003 3
#2: 51557094 65122111 1 2007 2
#3: 51557093 65122111 0 2006 1
Upvotes: 3
Reputation: 2000
First, let's build your table
library(data.table)
dtId <- data.table(
id1 = c(rep(51557094, 5), 51557093),
year = c(2000 + c(3, 4, 5, 7, 8, 6)),
id2 = rep(65122111, 6)
)
We can sort it and add a "newMatch" column that is TRUE if a new match is starting, i.e. the year hasn't increased by one compared with the previous row.
dtId <- dtId[order(id1, id2, year)]
dtId[, newMatch := c(FALSE, year[-1] != year[-.N] + 1)]
Now we can add a "match" column which value is the match ID and we can aggregate.
dtId[, match := cumsum(newMatch)]
dtAggr <- dtId[
, list(year = min(year), length = .N),
by = c("id1", "id2", "match")
]
Removing the "match" column we have the expected result.
dtAggr[, match := NULL]
dtAggr
Upvotes: 1