user2363642
user2363642

Reputation: 757

subscript out of bounds error datatable

I have a simple data frame as follows:

 hai_dispense_number hai_atc date_of_claim hai_age
1                  tom    A10A    2011-11-01      42
2                  tom    A10B    2011-11-01      42
3                  tom    G10R    2011-11-01      42
4                  tom    A10A    2012-02-02      42
5                  tom    A10A    2012-04-03      42
6                 mary    A10A    2012-02-02      36
7                 mary    A10A    2012-03-02      36
8                 mary    G123    2012-03-02      36
9                 mary    E123    2012-05-02      36
10                mary    T123    2012-07-02      36
11                mary    A10A    2012-08-02      43
12                 sue   GR123    2012-03-02      43
13                 sue   GR123    2012-03-08      43
14                 sue   GR123    2012-09-03      43
15                 sue   GR123    2012-10-01      43
16                 sue   GR123    2012-03-02      43
17                 pat    A10A    2012-01-02      52
18                 pat   GR123    2011-11-12      52
19                 pat    A10A    2012-03-03      52
20                 pat   GR123    2012-01-01      52
21                 pat    A10A    2012-05-06      52

I want to isolate people who had the code A10A only from 2012-01-01 onwards, and not in 2011. There are some people in the above dataframe who did not have the code A10A at any stage and so I'd like to get rid of them too. So in the end, I want to have data only for pat and mary. Here is how I do it:

Step 1: Assigns the "index date". In other words the first date in 2012 that someone got the code A10A

dt2<- data.table(dff,key=c('hai_dispense_number','date_of_claim'))
dt2[,date_of_claim := as.Date(date_of_claim)]
dt2[grepl('^A10A*?', as.character(dt2$hai_atc))& between(date_of_claim,as.Date("2012-01-01"),as.Date("2012-12-31")),
    index := as.character(min(date_of_claim))
    ,   by=c('hai_dispense_number','hai_atc')]  #
dt2$index<-as.Date(dt2$index, origin='1970-01-01')
dt2$hai_atc<-as.character(dt2$hai_atc)

Step 2: Creating the index date creates some NA values for rows that were not A10A or occurred before 2012-01. I need to fill in this NA value with each person's unique index date. This code usually works, but for reason with these data, I keep getting an error

dt2[, index := index[!is.na(index)][[1]], by=hai_dispense_number]  ##gives the index date for each person to each of their individual rows of data

Error in index[!is.na(index)][[1]] : subscript out of bounds

Steps 3, 4 and 5: I usually can follow on from here, and find out who had a code A10A in 2011 and then exclude them. But my process is being interrupted by the above error.

I looked at this page: Subscript out of bounds - general definition and solution? and tried running the tests they recommend: but it doesn't look like i have extra rows for the variable index. Perhaps I'm not testing properly.

If anyone can shed any light on why the above is happening I would be really grateful. Thank you!

Upvotes: 1

Views: 1004

Answers (1)

Jaap
Jaap

Reputation: 83215

You can achieve what you want much easier by using the any function. With any(hai_atc=="A10A") you create a logical vector for all people having A10A and with !any(hai_atc=="A10A" & year(date_of_claim) == 2011) you update that vecor for the ones who have that code in the year 2011 to FALSE. With [idx==TRUE] you filter just the cases you need and with [,idx:=NULL] you remove the idx column:

newDT <- DT[, idx := any(hai_atc=="A10A") & !any(hai_atc=="A10A" & year(date_of_claim) == 2011),
            by = hai_dispense_number
            ][idx==TRUE][,idx:=NULL]

this results in:

> newDT
    hai_dispense_number hai_atc date_of_claim hai_age
 1:                mary    A10A    2012-02-02      36
 2:                mary    A10A    2012-03-02      36
 3:                mary    G123    2012-03-02      36
 4:                mary    E123    2012-05-02      36
 5:                mary    T123    2012-07-02      36
 6:                mary    A10A    2012-08-02      43
 7:                 pat    A10A    2012-01-02      52
 8:                 pat   GR123    2011-11-12      52
 9:                 pat    A10A    2012-03-03      52
10:                 pat   GR123    2012-01-01      52
11:                 pat    A10A    2012-05-06      52

In response to your comment: You might want to use single square brackets (i.e.: [ ]) like dt2[, index := index[!is.na(index)][1], by=hai_dispense_number]. Furthermore, your code can be simplified to:

dt2 <- data.table(mydf,key=c('hai_dispense_number','date_of_claim'))
dt2[, date_of_claim := as.Date(date_of_claim)]
dt2[grepl('^A10A*?', hai_atc) & between(date_of_claim, as.Date("2012-01-01"), as.Date("2012-12-31")),
    index := min(date_of_claim), by=.(hai_dispense_number,hai_atc)]

dt2[, index := index[!is.na(index)][1], by=hai_dispense_number]

However, this gives not the result as you described:

> dt2
    hai_dispense_number hai_atc date_of_claim hai_age      index
 1:                mary    A10A    2012-02-02      36 2012-02-02
 2:                mary    A10A    2012-03-02      36 2012-02-02
 3:                mary    G123    2012-03-02      36 2012-02-02
 4:                mary    E123    2012-05-02      36 2012-02-02
 5:                mary    T123    2012-07-02      36 2012-02-02
 6:                mary    A10A    2012-08-02      43 2012-02-02
 7:                 pat   GR123    2011-11-12      52 2012-01-02
 8:                 pat   GR123    2012-01-01      52 2012-01-02
 9:                 pat    A10A    2012-01-02      52 2012-01-02
10:                 pat    A10A    2012-03-03      52 2012-01-02
11:                 pat    A10A    2012-05-06      52 2012-01-02
12:                 sue   GR123    2012-03-02      43       <NA>
13:                 sue   GR123    2012-03-02      43       <NA>
14:                 sue   GR123    2012-03-08      43       <NA>
15:                 sue   GR123    2012-09-03      43       <NA>
16:                 sue   GR123    2012-10-01      43       <NA>
17:                 tom    A10A    2011-11-01      42 2012-02-02
18:                 tom    A10B    2011-11-01      42 2012-02-02
19:                 tom    G10R    2011-11-01      42 2012-02-02
20:                 tom    A10A    2012-02-02      42 2012-02-02
21:                 tom    A10A    2012-04-03      42 2012-02-02

The reason why you are not getting the desired result with your code is that you are not excluding the groups which have A10A in 2011. The any function is specifically designed to achieve logical operations as described by you.

Furthermore, by using the any function you can achieve the desired result in a simpler way. And that is also applicable for more complex datasets. It's also not a problem to use grepl inside any:

newDT2 <- DT[, idx := any(grepl('^A10A*?', hai_atc)) & !any(grepl('^A10A*?', hai_atc) & year(date_of_claim) == 2011),
             by = hai_dispense_number
             ][idx==TRUE][,idx:=NULL]

which gives the same result:

> identical(newDT, newDT2)
[1] TRUE

Upvotes: 3

Related Questions