R data.table update errors

Question

Some more problems I'm having with old data.table code related to this: R: number rows that match >= other row within group

The data looks like this, with a different ID for each person, IDSEQ is the sequence of each admission for this person and TAG for diabetes medication (TAG=1 hypoglycaemic agent and TAG=2 insulin).

   ID IDSEQ TAG
1   1     1   0
2   1     2   0
3   1     3   0
4   1     4   0
5   1     5   0
6   1     6   0
7   1     7   0
8   1     8   1
9   1     9   0
10  1    10   0
11  2     1   0
12  2     2   0
13  2     3   0
14  2     4   1
15  2     5   0
16  2     6   0
17  2     7   0
18  2     8   2
19  2     9   0
20  2    10   0

# recreate this data with
df <- data.frame(ID=c(rep(1,10),rep(2,10)),
                 IDSEQ=c(1:10,1:10),
                 TAG=c(rep(0,7),1,0,0,0,0,0,1,0,0,0,2,0,0))

Exercise: Create two new index sequence variables, COND1 using TAG=1 as the index record, and COND2 using TAG=2 as the index record. Write your syntax so that only record in a block prior to the index records in a block prior to the index record are numbered with a '0'.

a) TAG=1 (seems to still work)

DT <- data.table(df)
setkey(DT, ID)

# counter for condition 1
tmp <- df[which(df$TAG == 1),1:2]
DT1 <- data.table(tmp)
DT1 <- DT1[, list(IDSEQ=min(IDSEQ)), by=ID]

DT[, COND1:=0L]
DT[DT[DT1,.I[IDSEQ >= i.IDSEQ]],COND1:=1:.N,by=ID]
# previously 
# DT[DT[DT1,.I[IDSEQ >= i.IDSEQ]]$V1,COND1:=1:.N,by=ID]

a) TAG=2 does not result in the correct result anymore, it is not linked on both ID and IDSEQ.

tmp <- df[which(df$TAG == 2),1:2]
DT1 <- data.table(tmp)
DT1 <- DT1[, list(IDSEQ=min(IDSEQ)), by=ID]

DT[, COND2:=0L]
DT[DT[DT1,.I[IDSEQ >= i.IDSEQ]],COND2:=1:.N,by=ID]

# previously worked with
# DT[DT[DT1,.I[IDSEQ >= i.IDSEQ]]$V1,COND2:=1:.N,by=ID]

The overall result should look like this

   ID IDSEQ TAG COND1 COND2
1   1     1   0     0     0
2   1     2   0     0     0
3   1     3   0     0     0
4   1     4   0     0     0
5   1     5   0     0     0
6   1     6   0     0     0
7   1     7   0     0     0
8   1     8   1     1     0
9   1     9   0     2     0
10  1    10   0     3     0
11  2     1   0     0     0
12  2     2   0     0     0
13  2     3   0     0     0
14  2     4   1     1     0
15  2     5   0     2     0
16  2     6   0     3     0
17  2     7   0     4     0
18  2     8   2     5     1
19  2     9   0     6     2
20  2    10   0     7     3

# recreate this data with
  data.frame(ID=c(rep(1,10),rep(2,10)),
             IDSEQ=c(1:10,1:10),
             TAG=c(rep(0,7),1,0,0,0,0,0,1,0,0,0,2,0,0),
             COND1=c(rep(0,7),1,2,3,0,0,0,1,2,3,4,5,6,7),
             COND2=c(rep(0,17),1,2,3))

data.table Version 1.9.4, R version 3.1.1

FloE · Accepted Answer

The corrected last line of your example:

DT[DT[DT1, .I[IDSEQ >= i.IDSEQ], by=.EACHI]$V1, COND2:=1:.N, by=ID]

Alternatively, you could also alter the default behavior, although I wouldn't recommend it due to compatibility issues.

options(datatable.old.bywithoutby=TRUE)

Some background information:

The selection criteria for TAG==2 results in

DT[DT1,.I[IDSEQ >= i.IDSEQ]]
[1]  8  9 10

which are the correct indices (line number) for ID==2 after the subset/join.

You would experience the same problem/result if there would be e.g. an ID==0 without any TAG==1.

df <- data.frame(ID=c(0, rep(1,10),rep(2,10)),
             IDSEQ=c(1, 1:10,1:10),
             TAG=c(0, rep(0,7),1,0,0,0,0,0,1,0,0,0,2,0,0))

DT <- data.table(df)
setkey(DT, ID)

# counter for condition 1
tmp <- df[which(df$TAG == 1),1:2]
DT1 <- data.table(tmp)
DT1 <- DT1[, list(IDSEQ=min(IDSEQ)), by=ID]

DT[, COND1:=0L]
DT[DT[DT1, .I[IDSEQ >= i.IDSEQ]], COND1:=1:.N, by=ID]

DT[c(1,2, 7:10),]
   ID IDSEQ TAG COND1
1:  0     1   0     0
2:  1     1   0     0
3:  1     6   0     0
4:  1     7   0     1
5:  1     8   1     2
6:  1     9   0     3

data.table handles this situation (by-without-by) differently since version 1.9.4. When loading data.table it states

> library(data.table)
data.table 1.9.4  For help type: ?data.table
*** NB: by=.EACHI is now explicit. See README to restore previous behaviour.

Therefore you need to explicitly tell that you want to get your results j (the second part of the statement) not only for the current subset (defined in the first part), but grouped by every key variable.

More information can be found in the data.table FAQ 1.13 and 1.14 on page 5 and 6.

R data.table update errors

Answers (2)

Related Questions