NubCookie
NubCookie

Reputation: 1

Lookup observations data based on another table

I have 2 tibble data frames that I am trying to reconcile. The first tibble has over a million observations, the first few rows are as follows:

    data
    ID  Time(Converted to number)  
     1    23160
     1    23161
     1    23162
     1    23163
     1    23164
     1    23165
     2    24251
     2    24252

The second tibble is a lookup table (that has information of a particular event that has occurred), simplified version as follows:

     lookup_table
     ID  Event_Time  Event_Indicator Number_of_Cumulative_Events
     1    23162           1                    1
     1    23164           1                    2
     2    24255           1                    1
     2    24280           0                    1

I would like to create a 3rd column in the first tibble, such that it shows the number of cumulative events at that time of the observation. The 3rd column in the above example would therefore be:

    ID  Time(Converted to number)  Number 
     1    23160                     0      
     1    23161                     0
     1    23162                     1
     1    23163                     1
     1    23164                     2
     1    23165                     2
     2    24251                     0
     2    24252                     0

I am trying to avoid having to loop through the millions of observations to compare each observation's time to the Event_Time in the lookup table because of computation time.

However, I am not sure how to go about doing this without the use of a loop. The issue is that the lookup_table contains some IDs multiple times, if all IDs only appeared in the lookup_table only once, then I could do:

     data$Event_Time <- lookup_table[match(data$ID, lookup_table$ID),"Event_Time"] 
     data$Number <- data %>% mutate(ifelse(Time >= Event_Time,1,0))

Any ideas how I could avoid the use of a loop and yet apply the lookup conditions for each observation? Thank you.

Edit: I am not trying to join the tables, but more of comparing the time columns in the lookup_table and data table to obtain my desired column. Example, if I were to write an inefficient loop function, it would be:

     for (i in 1:nrow(data)) {
          data$Number[i] <- subset(lookup_table,ID == data$ID[i])[max(which
         (data$Time[i] >= lookup_table$Event_Time)), "Number_of_Cumulative_Events"]
     }

Upvotes: 0

Views: 76

Answers (1)

Uwe
Uwe

Reputation: 42544

A possible solution is to count the cumulative events after the join. Note that an update on join is used.

library(data.table)
setDT(data)[, new := 0L][setDT(lookup_table), on = .(ID, Time = Event_Time), new := Event_Indicator][
  , new := cumsum(new), by = ID][]
   ID  Time new
1:  1 23160   0
2:  1 23161   0
3:  1 23162   1
4:  1 23163   1
5:  1 23164   2
6:  1 23165   2
7:  2 24251   0
8:  2 24252   0

Alternatively,

setDT(data)[setDT(lookup_table), on = .(ID, Time = Event_Time), new := Event_Indicator][
  is.na(new), new := 0][
    , new := cumsum(new), by = ID][]

will set missing entries to zero after the join.

A completely different approach is to use a rolling join:

lookup_table[, !"Event_Indicator"][data, on = .(ID, Event_Time = Time), roll = TRUE]
   ID Event_Time Number_of_Cumulative_Events
1:  1      23160                          NA
2:  1      23161                          NA
3:  1      23162                           1
4:  1      23163                           1
5:  1      23164                           2
6:  1      23165                           2
7:  2      24251                          NA
8:  2      24252                          NA

(NA's have been left untouched for illustration)

Upvotes: 2

Related Questions