FooBar
FooBar

Reputation: 16488

Number of successive rows

Consider a data.table that contains matches between id1, id2 for several years.

        id1 year      id2
1: 51557094 2003 65122111
2: 51557094 2004 65122111
3: 51557094 2005 65122111
4: 51557094 2007 65122111
5: 51557094 2008 65122111
6: 51557093 2006 65122111

For any of these matches, I want to find out the duration, together with the year the match started. If there was no data for a specific year, a match ends (and the next year, if there is data again, a new match starts).

Hence, for the sample data above, the expected output would be

        id1 year      id2 length
1: 51557094 2003 65122111      3
2: 51557094 2007 65122111      2
3: 51557093 2006 65122111      1

I accepted one of the answers because it brought me far enough, but notice that it was not correct. While it worked for the sample data, it will not for the following

> dtId
        id1 year      id2
1: 51557094 2003 65122111
2: 51557094 2004 65122111
3: 51557094 2005 65122111
4: 51557094 2007 65122111
5: 51557094 2008 65122111
6: 51557094 2006 65122112

> setkey(dtId, id1, id2, year)
> dtId[,grp := cumsum(c(1,diff(year)) > 1),by=id1]
> dtId[,list(year=year[1],length=length(year)),by=list(id1,id2,grp)]
        id1      id2 grp year length
1: 51557094 65122111   0 2003      5
2: 51557094 65122112   0 2006      1

Instead, creating the match variable grp over both id1, id2 solves the issue:

> dtId[,grp := cumsum(c(1,diff(year)) > 1),by=list(id1, id2)]
> dtId[,list(year=year[1],length=length(year)),by=list(id1,id2,grp)]
        id1      id2 grp year length
1: 51557094 65122111   0 2003      3
2: 51557094 65122112   0 2006      1
3: 51557094 65122111   1 2007      2

Upvotes: 1

Views: 77

Answers (2)

thelatemail
thelatemail

Reputation: 93813

An attempt:

dat[,grp := cumsum(c(1,diff(year)) > 1),by=list(id1,id2)]
dat[,list(year=year[1],length=length(year)),by=list(id1,id2,grp)]

#        id1      id2 grp year length
#1: 51557094 65122111   0 2003      3
#2: 51557094 65122111   1 2007      2
#3: 51557093 65122111   0 2006      1

Upvotes: 3

Michele Usuelli
Michele Usuelli

Reputation: 2000

First, let's build your table

library(data.table)
dtId <- data.table(
  id1 = c(rep(51557094, 5), 51557093),
  year = c(2000 + c(3, 4, 5, 7, 8, 6)),
  id2 = rep(65122111, 6)
  )

We can sort it and add a "newMatch" column that is TRUE if a new match is starting, i.e. the year hasn't increased by one compared with the previous row.

dtId <- dtId[order(id1, id2, year)]
dtId[, newMatch := c(FALSE, year[-1] != year[-.N] + 1)]

Now we can add a "match" column which value is the match ID and we can aggregate.

dtId[, match := cumsum(newMatch)]
dtAggr <- dtId[
  , list(year = min(year), length = .N),
  by = c("id1", "id2", "match")
  ]

Removing the "match" column we have the expected result.

dtAggr[, match := NULL]
dtAggr

Upvotes: 1

Related Questions