Phann
Phann

Reputation: 1327

Group data.table by continuity of variable

I want to separate a data.table into groups based on the continuity of one variable. So to speak, from this data.table:

DT <- data.table(Var1 = c(1:5, 7:10))

I want it to be grouped like this:

#    Var1 group
# 1:    1     1 # 1 to 5 is continuous with a maximal difference of 1
# 2:    2     1
# 3:    3     1
# 4:    4     1
# 5:    5     1
# 6:    7     2 # 6 to 10 is continuous again
# 7:    8     2
# 8:    9     2
# 9:   10     2

The difference of Var1 should not be limited to one like in this minimal example, but be adjustable, so that DT <- data.table(Var1 = c(seq(1,10, 2), seq(13,30, 2))) will also be separated into two groups when given a maximal difference of 2.

EDIT: I should clarify that a 'maximal difference' of 2 or more is meant in a way that differences in Var1 smaller than two should be treated as 'continuous'. Furthermore the variable Var1 should not be limited to integer values. The last thing could be avoided by multiplying e.g. 0.14 by 100 to get 14 and also multiplying 'maximal difference' by 100.

Upvotes: 1

Views: 134

Answers (2)

Roman
Roman

Reputation: 17648

A base R solution.

foo <- function(x){
 gr <- which(!(duplicated(diff(x)) | duplicated(diff(x), fromLast = T)))
 if(length(gr) == 1){
   cbind(Var1=x,group=rep(1:(length(gr)+1), c(min(gr),length(x)-max(gr))))
 }else{
   cbind(Var1=x,group=rep(1:(length(gr)+1), c(min(gr), diff(gr),length(x)-max(gr))))
 }
}

All kind of differences are working.

foo(c(seq(1,10, 2), seq(13,30, 2)))
      Var1 group
 [1,]    1     1
 [2,]    3     1
 [3,]    5     1
 [4,]    7     1
 [5,]    9     1
 [6,]   13     2
 [7,]   15     2
 [8,]   17     2
 [9,]   19     2
[10,]   21     2
[11,]   23     2
[12,]   25     2
[13,]   27     2
[14,]   29     2

Three groups are working as well.

foo(c(1:5, 7:10, 13:20))
      Var1 group
 [1,]    1     1
 [2,]    2     1
 [3,]    3     1
 [4,]    4     1
 [5,]    5     1
 [6,]    7     2
 [7,]    8     2
 [8,]    9     2
 [9,]   10     2
[10,]   13     3
[11,]   14     3
[12,]   15     3
[13,]   16     3
[14,]   17     3
[15,]   18     3
[16,]   19     3
[17,]   20     3

For a data.table you can try:

foo <- function(x){
 gr <- which(!(duplicated(diff(x)) | duplicated(diff(x), fromLast = T)))
 if(length(gr) == 1){
   rep(1:(length(gr)+1), c(min(gr),length(x)-max(gr)))
 }else{
   rep(1:(length(gr)+1), c(min(gr), diff(gr),length(x)-max(gr)))
 }
}
DT[, group := foo(Var1)]

Upvotes: 1

Roland
Roland

Reputation: 132706

DT[, group := rleid(cumprod(c(1, diff(Var1))))]
#   Var1 group
#1:    1     1
#2:    2     1
#3:    3     1
#4:    4     1
#5:    5     1
#6:    7     2
#7:    8     2
#8:    9     2
#9:   10     2

step <- 2
DT <- data.table(Var1 = c(seq(1,10, 2), seq(13,30, 2)))
DT[, group := rleid(cumsum(c(FALSE, diff(Var1) != step)))]
#    Var1 group
# 1:    1     1
# 2:    3     1
# 3:    5     1
# 4:    7     1
# 5:    9     1
# 6:   13     2
# 7:   15     2
# 8:   17     2
# 9:   19     2
#10:   21     2
#11:   23     2
#12:   25     2
#13:   27     2
#14:   29     2

Upvotes: 3

Related Questions