Reputation: 1
I have 2 tibble data frames that I am trying to reconcile. The first tibble has over a million observations, the first few rows are as follows:
data
ID Time(Converted to number)
1 23160
1 23161
1 23162
1 23163
1 23164
1 23165
2 24251
2 24252
The second tibble is a lookup table (that has information of a particular event that has occurred), simplified version as follows:
lookup_table
ID Event_Time Event_Indicator Number_of_Cumulative_Events
1 23162 1 1
1 23164 1 2
2 24255 1 1
2 24280 0 1
I would like to create a 3rd column in the first tibble, such that it shows the number of cumulative events at that time of the observation. The 3rd column in the above example would therefore be:
ID Time(Converted to number) Number
1 23160 0
1 23161 0
1 23162 1
1 23163 1
1 23164 2
1 23165 2
2 24251 0
2 24252 0
I am trying to avoid having to loop through the millions of observations to compare each observation's time to the Event_Time in the lookup table because of computation time.
However, I am not sure how to go about doing this without the use of a loop. The issue is that the lookup_table contains some IDs multiple times, if all IDs only appeared in the lookup_table only once, then I could do:
data$Event_Time <- lookup_table[match(data$ID, lookup_table$ID),"Event_Time"]
data$Number <- data %>% mutate(ifelse(Time >= Event_Time,1,0))
Any ideas how I could avoid the use of a loop and yet apply the lookup conditions for each observation? Thank you.
Edit: I am not trying to join the tables, but more of comparing the time columns in the lookup_table and data table to obtain my desired column. Example, if I were to write an inefficient loop function, it would be:
for (i in 1:nrow(data)) {
data$Number[i] <- subset(lookup_table,ID == data$ID[i])[max(which
(data$Time[i] >= lookup_table$Event_Time)), "Number_of_Cumulative_Events"]
}
Upvotes: 0
Views: 76
Reputation: 42544
A possible solution is to count the cumulative events after the join. Note that an update on join is used.
library(data.table)
setDT(data)[, new := 0L][setDT(lookup_table), on = .(ID, Time = Event_Time), new := Event_Indicator][
, new := cumsum(new), by = ID][]
ID Time new 1: 1 23160 0 2: 1 23161 0 3: 1 23162 1 4: 1 23163 1 5: 1 23164 2 6: 1 23165 2 7: 2 24251 0 8: 2 24252 0
Alternatively,
setDT(data)[setDT(lookup_table), on = .(ID, Time = Event_Time), new := Event_Indicator][
is.na(new), new := 0][
, new := cumsum(new), by = ID][]
will set missing entries to zero after the join.
A completely different approach is to use a rolling join:
lookup_table[, !"Event_Indicator"][data, on = .(ID, Event_Time = Time), roll = TRUE]
ID Event_Time Number_of_Cumulative_Events 1: 1 23160 NA 2: 1 23161 NA 3: 1 23162 1 4: 1 23163 1 5: 1 23164 2 6: 1 23165 2 7: 2 24251 NA 8: 2 24252 NA
(NA
's have been left untouched for illustration)
Upvotes: 2