siddhesh tiwari
siddhesh tiwari

Reputation: 185

How to build efficient loops for lookup in R

I have a Data set consisting of dates when a person left the network. A person can leave a network multiple times as they may join the network again after leaving it. Following code replicates the scenario.

library(data.table)
Leaving_Date<- data.table(Id= c(1,2,3,4,3,5),Date = as.Date(
c("2017-01-01","2017-02-03","2017-01-01","2017-03-10","2017-02-09","2017-02-05")))

(ids repeat multiple times in this table as a person can leave a network multiple times given they joined it again)

 > Leaving_Date
   Id       Date
1:  1 2017-01-01
2:  2 2017-02-03
3:  3 2017-01-01
4:  4 2017-03-10
5:  3 2017-02-09
6:  5 2017-02-05

I have another dataset giving the dates whenever a particular person was followed up which can be before or after they left the network. Following code replicates the scenario.

FOLLOWUPs <- data.table(Id = c(1,2,3,2,2,3,3,4,1,5),
                        Date =as.Date(c("2016-10-01","2017-02-04",
                        "2017-01-17","2017-02-23", "2017-03-03",
                        "2017-02-10","2017-02-11","2017-01-01",
                        "2017-01-15","2017-01-01")))


> FOLLOWUPs
    Id       Date
 1:  1 2016-10-01
 2:  2 2017-02-04
 3:  3 2017-01-17
 4:  2 2017-02-23
 5:  2 2017-03-03
 6:  3 2017-02-10
 7:  3 2017-02-11
 8:  4 2017-01-01
 9:  1 2017-01-15
10:  5 2017-01-01

Now I want to lookup each case in Leaving_Date and find dates when they were followed up and create three columns(SevenDay, FourteenDay,ThirtyDay) indicating time period of followup(incase if there was any) in 0s and 1s. I am using following code :

SEVENDAY_FOLLOWUP <- vector()
FOURTEEN_FOLLOWUP <- vector()
THIRTYDAY_FOLLOWUP <- vector()
for(i in 1:nrow(Leaving_Date)){
  sub_data <- FOLLOWUPs[Id== Leaving_Date[i,1]]
  if(nrow(sub_data[Date > Leaving_Date[i,Date] &
                   Date < (Leaving_Date[i,Date]+7)])== 0){
     SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,0)
   }
   else{
     SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,1)
   }

   if(nrow(sub_data[Date > Leaving_Date[i,Date] &
                    Date < (Leaving_Date[i,Date]+14)])== 0){
     FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,0)
   }
   else{
     FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,1)
   }

   if(nrow(sub_data[Date > Leaving_Date[i,Date] &
                    Date < (Leaving_Date[i,Date]+30)])== 0){
     THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,0)
   }
   else{
     THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,1)
   }
 }               


 Leaving_Date$SEVENDAY <- as.vector(SEVENDAY_FOLLOWUP)
 Leaving_Date$FOURTEENDAY <- as.vector(FOURTEEN_FOLLOWUP)
 Leaving_Date$THIRTYDAY <- as.vector(THIRTYDAY_FOLLOWUP)

Final Data

 > Leaving_Date
    Id       Date SEVENDAY FOURTEENDAY THIRTYDAY
 1:  1 2017-01-01        0           0         1
 2:  2 2017-02-03        1           1         1
 3:  3 2017-01-01        0           0         1
 4:  4 2017-03-10        0           0         0
 5:  3 2017-02-09        1           1         1
 6:  5 2017-02-05        0           0         0

This code is very inefficient as I have to run it for 100k observations and it takes a lot of time. Is there any efficient way to do this.

Upvotes: 1

Views: 148

Answers (3)

Frank
Frank

Reputation: 66819

Using a non-equi join:

setorder(FOLLOWUPs, Id, Date)
Leaving_Date[, n := 
  FOLLOWUPs[.SD, on=.(Id, Date > Date), mult = "first", x.Date - i.Date]
]

   Id       Date       n
1:  1 2017-01-01 14 days
2:  2 2017-02-03  1 days
3:  3 2017-01-01 16 days
4:  4 2017-03-10 NA days
5:  3 2017-02-09  1 days
6:  5 2017-02-05 NA days

Switching from Date to IDate will probably make this about twice as fast. See ?IDate.


I think it's best to stop here, but n can be compared against 7, 14, 30 if necessary, like

Leaving_Date[, bin := c(7, 14, 30)[ findInterval(n, c(0, 7, 14, 30)) ]]

   Id       Date       n bin
1:  1 2017-01-01 14 days  30
2:  2 2017-02-03  1 days   7
3:  3 2017-01-01 16 days  30
4:  4 2017-03-10 NA days  NA
5:  3 2017-02-09  1 days   7
6:  5 2017-02-05 NA days  NA

Side note: Please don't give tables names like this.

Upvotes: 4

Akhil Nair
Akhil Nair

Reputation: 3274

We can do this as a query instead of a loop. First, I cleaned your data.tables a bit because I was getting confused by the variable names.

To make things easier for the comparison step, we first pre-compute the follow up date limit for the 7, 14 and 30 day thresholds.

library(dplyr)

dt_leaving_neat = Leaving_Date %>%
  mutate(.id = 1:n()) %>%
  mutate(limit_07 = Date + 7) %>%
  mutate(limit_14 = Date + 14) %>%
  mutate(limit_30 = Date + 30) %>%
  rename(id = .id, id_person = Id, leaving_date = Date)

dt_follow_neat = FOLLOWUPs %>% 
  select(id_person = Id, followed_up_date = Date)

The actual operation is just a query. It's written out in dplyr for readability, but if speed is a main concern of yours, you could translate it to data.table. I'd recommend running each step in the pipeline to make sure you understand what's going on.

dt_followed_up = dt_leaving_neat %>%
  tidyr::gather(follow_up, limit_date, limit_07:limit_30) %>%
  left_join(dt_follow_neat, by = "id_person") %>%
  mutate(followed_up = (followed_up_date > leaving_date) & (followed_up_date < limit_date)) %>%
  select(id, id_person, leaving_date, follow_up, followed_up) %>%
  filter(followed_up == TRUE) %>%
  unique() %>%
  tidyr::spread(follow_up, followed_up, fill = 0) %>%
  select(id, id_person, leaving_date, limit_07, limit_14, limit_30)

The idea is to join the leaving dates to the follow up dates and check whether the follow up date is within the threshold (and also after the leaving date, as presumably you can't follow up before leaving).

Then some final cleaning to return your desired format. You can use select or rename to change the column names back too.

dt_result = dt_leaving_neat %>%
  select(id, id_person, leaving_date) %>%
  left_join(dt_followed_up, by = c("id", "id_person", "leaving_date"))

dt_result[is.na(dt_result)] = 0

Result

> dt_result
  id id_person leaving_date limit_07 limit_14 limit_30
1  1         1   2017-01-01        0        0        1
2  2         2   2017-02-03        1        1        1
3  3         3   2017-01-01        0        0        1
4  4         4   2017-03-10        0        0        0
5  5         3   2017-02-09        1        1        1
6  6         5   2017-02-05        0        0        0

And following Andrew's answer, an equivalent 1 line data.table soln is

FOLLOWUPs[Leaving_Date, on = "Id", .(Id, follow_date = Date, leaving_date = i.Date)][, diff := follow_date - leaving_date][, .(seven = any(diff %in% 0:6), fourteen = any(diff %in% 0:13), thirty = any(diff %in% 0:29)), .(Id, leaving_date)]

Upvotes: 0

Andrew Gustar
Andrew Gustar

Reputation: 18425

I think this does what you are looking for using dplyr.

It does an 'inner join' by Id - generating all combinations of dates in the two data frames for a given Id - then calculates the date differences, groups by Id, then checks whether there are values falling in the ranges for your three categories.

library(dplyr)

Leaving_Date2 <- Leaving_Date %>% inner_join(FOLLOWUPs %>% rename(FU_Date=Date)) %>% 
  mutate(datediff=as.numeric(FU_Date-Date)) %>% group_by(Id,Date) %>% 
  summarise(SEVENDAY=as.numeric(any(datediff %in% 0:6)),
            FOURTEENDAY=as.numeric(any(datediff %in% 0:13)),
            THIRTYDAY=as.numeric(any(datediff %in% 0:29)))

Upvotes: 0

Related Questions