Reputation: 757
I have a simple data frame as follows:
hai_dispense_number hai_atc date_of_claim hai_age
1 tom A10A 2011-11-01 42
2 tom A10B 2011-11-01 42
3 tom G10R 2011-11-01 42
4 tom A10A 2012-02-02 42
5 tom A10A 2012-04-03 42
6 mary A10A 2012-02-02 36
7 mary A10A 2012-03-02 36
8 mary G123 2012-03-02 36
9 mary E123 2012-05-02 36
10 mary T123 2012-07-02 36
11 mary A10A 2012-08-02 43
12 sue GR123 2012-03-02 43
13 sue GR123 2012-03-08 43
14 sue GR123 2012-09-03 43
15 sue GR123 2012-10-01 43
16 sue GR123 2012-03-02 43
17 pat A10A 2012-01-02 52
18 pat GR123 2011-11-12 52
19 pat A10A 2012-03-03 52
20 pat GR123 2012-01-01 52
21 pat A10A 2012-05-06 52
I want to isolate people who had the code A10A only from 2012-01-01 onwards, and not in 2011. There are some people in the above dataframe who did not have the code A10A at any stage and so I'd like to get rid of them too. So in the end, I want to have data only for pat and mary. Here is how I do it:
Step 1: Assigns the "index date". In other words the first date in 2012 that someone got the code A10A
dt2<- data.table(dff,key=c('hai_dispense_number','date_of_claim'))
dt2[,date_of_claim := as.Date(date_of_claim)]
dt2[grepl('^A10A*?', as.character(dt2$hai_atc))& between(date_of_claim,as.Date("2012-01-01"),as.Date("2012-12-31")),
index := as.character(min(date_of_claim))
, by=c('hai_dispense_number','hai_atc')] #
dt2$index<-as.Date(dt2$index, origin='1970-01-01')
dt2$hai_atc<-as.character(dt2$hai_atc)
Step 2: Creating the index date creates some NA values for rows that were not A10A or occurred before 2012-01. I need to fill in this NA value with each person's unique index date. This code usually works, but for reason with these data, I keep getting an error
dt2[, index := index[!is.na(index)][[1]], by=hai_dispense_number] ##gives the index date for each person to each of their individual rows of data
Error in index[!is.na(index)][[1]] : subscript out of bounds
Steps 3, 4 and 5: I usually can follow on from here, and find out who had a code A10A in 2011 and then exclude them. But my process is being interrupted by the above error.
I looked at this page: Subscript out of bounds - general definition and solution? and tried running the tests they recommend: but it doesn't look like i have extra rows for the variable index. Perhaps I'm not testing properly.
If anyone can shed any light on why the above is happening I would be really grateful. Thank you!
Upvotes: 1
Views: 1004
Reputation: 83215
You can achieve what you want much easier by using the any
function. With any(hai_atc=="A10A")
you create a logical vector for all people having A10A
and with !any(hai_atc=="A10A" & year(date_of_claim) == 2011)
you update that vecor for the ones who have that code in the year 2011 to FALSE
. With [idx==TRUE]
you filter just the cases you need and with [,idx:=NULL]
you remove the idx
column:
newDT <- DT[, idx := any(hai_atc=="A10A") & !any(hai_atc=="A10A" & year(date_of_claim) == 2011),
by = hai_dispense_number
][idx==TRUE][,idx:=NULL]
this results in:
> newDT
hai_dispense_number hai_atc date_of_claim hai_age
1: mary A10A 2012-02-02 36
2: mary A10A 2012-03-02 36
3: mary G123 2012-03-02 36
4: mary E123 2012-05-02 36
5: mary T123 2012-07-02 36
6: mary A10A 2012-08-02 43
7: pat A10A 2012-01-02 52
8: pat GR123 2011-11-12 52
9: pat A10A 2012-03-03 52
10: pat GR123 2012-01-01 52
11: pat A10A 2012-05-06 52
In response to your comment: You might want to use single square brackets (i.e.: [ ]
) like dt2[, index := index[!is.na(index)][1], by=hai_dispense_number]
. Furthermore, your code can be simplified to:
dt2 <- data.table(mydf,key=c('hai_dispense_number','date_of_claim'))
dt2[, date_of_claim := as.Date(date_of_claim)]
dt2[grepl('^A10A*?', hai_atc) & between(date_of_claim, as.Date("2012-01-01"), as.Date("2012-12-31")),
index := min(date_of_claim), by=.(hai_dispense_number,hai_atc)]
dt2[, index := index[!is.na(index)][1], by=hai_dispense_number]
However, this gives not the result as you described:
> dt2
hai_dispense_number hai_atc date_of_claim hai_age index
1: mary A10A 2012-02-02 36 2012-02-02
2: mary A10A 2012-03-02 36 2012-02-02
3: mary G123 2012-03-02 36 2012-02-02
4: mary E123 2012-05-02 36 2012-02-02
5: mary T123 2012-07-02 36 2012-02-02
6: mary A10A 2012-08-02 43 2012-02-02
7: pat GR123 2011-11-12 52 2012-01-02
8: pat GR123 2012-01-01 52 2012-01-02
9: pat A10A 2012-01-02 52 2012-01-02
10: pat A10A 2012-03-03 52 2012-01-02
11: pat A10A 2012-05-06 52 2012-01-02
12: sue GR123 2012-03-02 43 <NA>
13: sue GR123 2012-03-02 43 <NA>
14: sue GR123 2012-03-08 43 <NA>
15: sue GR123 2012-09-03 43 <NA>
16: sue GR123 2012-10-01 43 <NA>
17: tom A10A 2011-11-01 42 2012-02-02
18: tom A10B 2011-11-01 42 2012-02-02
19: tom G10R 2011-11-01 42 2012-02-02
20: tom A10A 2012-02-02 42 2012-02-02
21: tom A10A 2012-04-03 42 2012-02-02
The reason why you are not getting the desired result with your code is that you are not excluding the groups which have A10A
in 2011. The any
function is specifically designed to achieve logical operations as described by you.
Furthermore, by using the any
function you can achieve the desired result in a simpler way. And that is also applicable for more complex datasets. It's also not a problem to use grepl
inside any
:
newDT2 <- DT[, idx := any(grepl('^A10A*?', hai_atc)) & !any(grepl('^A10A*?', hai_atc) & year(date_of_claim) == 2011),
by = hai_dispense_number
][idx==TRUE][,idx:=NULL]
which gives the same result:
> identical(newDT, newDT2)
[1] TRUE
Upvotes: 3