nick r
nick r

Reputation: 15

alternate to FOR loop in comparing two data frames(takes too much time)

Time(Data Frame):



CentralTime                        Batch Id
2020-04-01 03:46:01 UTC
2020-04-01 10:46:01 UTC
2020-04-01 10:54:18 UTC
2020-04-01 10:54:25 UTC
2020-04-01 10:54:31 UTC
2020-04-01 10:55:06 UTC
2020-04-01 10:55:12 UTC
2020-04-01 10:55:26 UTC
2020-04-01 10:55:32 UTC
2020-04-01 10:55:39 UTC
2020-04-01 10:55:45 UTC
2020-04-01 10:56:20 UTC
2020-04-01 10:56:26 UTC
2020-04-01 10:56:33 UTC
2020-04-01 10:56:39 UTC
2020-04-01 10:56:53 UTC
2020-04-01 10:56:59 UTC
2020-04-01 10:57:06 UTC
2020-04-01 10:57:14 UTC
2020-04-01 10:57:20 UTC
2020-04-01 11:37:20 UTC
2020-04-01 11:38:27 UTC
2020-04-01 11:38:33 UTC
2020-04-01 11:38:47 UTC
2020-04-01 11:38:53 UTC
2020-04-01 11:39:15 UTC
2020-04-01 11:39:27 UTC
2020-04-01 11:39:41 UTC
2020-04-01 11:39:47 UTC
2020-04-01 11:39:54 UTC
2020-04-01 11:40:00 UTC
2020-04-01 11:40:28 UTC
2020-04-01 17:30:28 UTC
2020-04-01 17:36:18 UTC
2020-04-02 00:26:18 UTC
2020-04-02 00:28:46 UTC
2020-04-02 00:29:20 UTC
2020-04-02 00:29:28 UTC
2020-04-02 00:29:34 UTC
2020-04-02 00:29:41 UTC
2020-04-02 00:29:47 UTC
2020-04-02 00:30:01 UTC
2020-04-02 00:30:07 UTC
2020-04-02 00:30:21 UTC
2020-04-02 00:30:27 UTC
2020-04-02 00:30:35 UTC
2020-04-02 00:30:42 UTC
2020-04-02 00:30:48 UTC
2020-04-02 00:30:55 UTC
2020-04-02 00:31:01 UTC
2020-04-02 00:31:15 UTC

BatchId(Data Frame):

Batch Id         dateTime               nextDate
ABC053272A  2020-04-01 00:00:48 UTC 2020-04-02 00:29:47 UTC
ABC053314A  2020-04-02 00:29:47 UTC 2020-04-03 00:12:58 UTC
ABC053330A  2020-04-03 00:12:58 UTC 2020-04-04 01:16:54 UTC
ABC053355A  2020-04-04 01:16:54 UTC 2020-04-07 00:33:57 UTC
ABC053405A  2020-04-07 00:33:57 UTC 2020-04-08 00:46:47 UTC
ABC053421A  2020-04-08 00:46:47 UTC 2020-04-09 00:36:56 UTC
ABC053447A  2020-04-09 00:36:56 UTC 2020-04-10 01:26:55 UTC
ABC053462A  2020-04-10 01:26:55 UTC 2020-04-13 08:13:50 UTC
ABC053470   2020-04-13 08:13:50 UTC 2020-04-14 10:07:56 UTC
ABC053496A  2020-04-14 10:07:56 UTC 2020-04-15 11:08:59 UTC
ABC053520A  2020-04-15 11:08:59 UTC 2020-04-16 17:51:28 UTC
ABC053553A  2020-04-16 17:51:28 UTC 2020-04-20 04:24:53 UTC
ABC053611A  2020-04-20 04:24:53 UTC 2020-04-22 00:09:56 UTC
ABC053652A  2020-04-22 00:09:56 UTC 2020-04-22 12:05:49 UTC
ABC053652B  2020-04-22 12:05:49 UTC 2020-04-23 14:12:53 UTC
ABC053686   2020-04-23 14:12:53 UTC 2020-04-24 12:14:55 UTC
ABC053694A  2020-04-24 12:14:55 UTC 2020-04-28 00:08:59 UTC
ABC053710A  2020-04-28 00:08:59 UTC 2020-04-29 00:34:56 UTC
ABC053769A  2020-04-29 00:34:56 UTC 2020-04-30 00:59:58 UTC
ABC053793A  2020-04-30 00:59:58 UTC 2020-05-01 00:41:54 UTC
ABC053827A  2020-05-01 00:41:54 UTC 2020-05-05 00:53:55 UTC
ABC053876A  2020-05-05 00:53:55 UTC 2020-05-06 04:10:55 UTC
ABC053892A  2020-05-06 04:10:55 UTC 2020-05-07 06:22:56 UTC
ABC053918A  2020-05-07 06:22:56 UTC 2020-05-08 06:02:55 UTC
ABC053942A  2020-05-08 06:02:55 UTC 2020-05-11 06:43:42 UTC
ABC053967A  2020-05-11 06:43:42 UTC 2020-05-12 07:01:57 UTC
ABC053991A  2020-05-12 07:01:57 UTC 2020-05-13 05:08:47 UTC
ABC054007A  2020-05-13 05:08:47 UTC 2020-05-14 03:36:55 UTC
ABC054023A  2020-05-14 03:36:55 UTC 2020-05-15 02:32:58 UTC
ABC054064A  2020-05-15 02:32:58 UTC 2020-05-18 04:32:57 UTC

I am trying to get values from batch id column(BatchId data frame) to batch id column(Time data frame) based on if CentralTime(Time data frame) lies in between dateTime(BatchId data frame) and nextDate(BatchId data frame)

I am using "for" loop for getting this values but its taking too much time. trying to find alternative solution. I just posted subset of data from what I have. Below is the code.

if(nrow(BatchId)!=0){
  for(i in 1:nrow(Time)){
    for(j in 1:nrow(BatchId)){
      if (Time[i,"CentralTime"] < BatchId[j,"nextDate"] & 
            Time[i,"CentralTime"]> BatchId[j,"dateTime"]) {
        Time[i,"batchId"]<-BatchId[j,"Batch Id"]
      }
    }
  }
}

Upvotes: 0

Views: 63

Answers (2)

r2evans
r2evans

Reputation: 160447

Rarely is a double-for loop a necessary (or even advisable) way to approach problems in R, and this is no exception. In fact, this begs a "non-equality" join. Base R doesn't support it, and while dplyr supports it when connected with dbplyr::sql_on, I suggest data.table's method:

I'll create my own Time so that see some matches:

library(data.table)
Time <- data.frame(CentralTime = BatchId$dateTime[3] + c(0, 1000, 3000, 9000))
Time
#            CentralTime
# 1: 2020-04-03 00:12:58
# 2: 2020-04-03 00:29:38
# 3: 2020-04-03 01:02:58
# 4: 2020-04-03 02:42:58

I'm assuming that neither frame is of data.table class, so I'll be a little careful. (If you're already using data.table, then you likely know what can be removed from this code. If not, then (1) data.table operates in-place unlike R's default copy-on-write semantics; (2) doing this requires another attribute (memory address) that must be set before data.table operators will work on it; and setDT and setDF change to and from that format, respectively. I recommend against keeping it as a data.table-class frame if you aren't doing data.table stuff on and to it, since there are some base R frame behaviors that do change.)

library(data.table)
setDT(BatchId)
setDT(Time)
out <- BatchId[Time, on = .(dateTime <= CentralTime, nextDate >= CentralTime)]
out <- out[, .(CentralTime = dateTime, BatchId)]
setDF(out)
out
#           CentralTime    BatchId
# 1 2020-04-03 00:12:58 ABC053314A
# 2 2020-04-03 00:12:58 ABC053330A
# 3 2020-04-03 00:29:38 ABC053330A
# 4 2020-04-03 01:02:58 ABC053330A
# 5 2020-04-03 02:42:58 ABC053330A

Some notes about how data.table does merges:

  • DT1[DT2, on = ...] is a left-join. Disregarding for a moment the non-equi join, this method is similar to

    ### base R
    merge(DT2, DT1, ...)
    
    ### dplyr
    right_join(DT1, DT2, ...)
    left_join(DT2, DT1, ...)
    
  • the time field in the "left" frame (Time, the DT1 in my previous example) is renamed to be the first of the non-equi fields used in the other frame, so if you look at out immediately after the join, it has columns BatchId, dateTime (even though these values are not necessarily equal to any within BatchId$dateTime ... confusing), and nextDate

And not unique to data.table, this join produces two rows for one of the times, since ids ABC053314A and ABC053330A overlap:

subset(BatchId, BatchId %in% c("ABC053314A", "ABC053330A"))
#       BatchId            dateTime            nextDate
# 1: ABC053314A 2020-04-02 00:29:47 2020-04-03 00:12:58
# 2: ABC053330A 2020-04-03 00:12:58 2020-04-04 01:16:54

a <- subset(BatchId, BatchId %in% c("ABC053314A", "ABC053330A"))
a$nextDate[1] == a$dateTime[2]
# [1] TRUE

(which may not always be perfectly equal, since they are effectively floating-point numbers).

If you have a strict inequality on one side, then that reduces this expansion:

setDT(BatchId)
setDT(Time)
out <- BatchId[Time, on = .(dateTime <= CentralTime, nextDate > CentralTime)]
out <- out[, .(CentralTime = dateTime, BatchId)]
setDF(out)
out
#           CentralTime    BatchId
# 1 2020-04-03 00:12:58 ABC053330A
# 2 2020-04-03 00:29:38 ABC053330A
# 3 2020-04-03 01:02:58 ABC053330A
# 4 2020-04-03 02:42:58 ABC053330A

### cleanup
setDF(BatchId)
setDF(Time)

Data:

BatchId <- read.table(header = TRUE, sep = "|", text = "
BatchId    |      dateTime           |     nextDate
ABC053272A | 2020-04-01 00:00:48 UTC | 2020-04-02 00:29:47 UTC
ABC053314A | 2020-04-02 00:29:47 UTC | 2020-04-03 00:12:58 UTC
ABC053330A | 2020-04-03 00:12:58 UTC | 2020-04-04 01:16:54 UTC
ABC053355A | 2020-04-04 01:16:54 UTC | 2020-04-07 00:33:57 UTC
ABC053405A | 2020-04-07 00:33:57 UTC | 2020-04-08 00:46:47 UTC
ABC053421A | 2020-04-08 00:46:47 UTC | 2020-04-09 00:36:56 UTC
ABC053447A | 2020-04-09 00:36:56 UTC | 2020-04-10 01:26:55 UTC
ABC053462A | 2020-04-10 01:26:55 UTC | 2020-04-13 08:13:50 UTC
ABC053470  | 2020-04-13 08:13:50 UTC | 2020-04-14 10:07:56 UTC
ABC053496A | 2020-04-14 10:07:56 UTC | 2020-04-15 11:08:59 UTC
ABC053520A | 2020-04-15 11:08:59 UTC | 2020-04-16 17:51:28 UTC
ABC053553A | 2020-04-16 17:51:28 UTC | 2020-04-20 04:24:53 UTC
ABC053611A | 2020-04-20 04:24:53 UTC | 2020-04-22 00:09:56 UTC
ABC053652A | 2020-04-22 00:09:56 UTC | 2020-04-22 12:05:49 UTC
ABC053652B | 2020-04-22 12:05:49 UTC | 2020-04-23 14:12:53 UTC
ABC053686  | 2020-04-23 14:12:53 UTC | 2020-04-24 12:14:55 UTC
ABC053694A | 2020-04-24 12:14:55 UTC | 2020-04-28 00:08:59 UTC
ABC053710A | 2020-04-28 00:08:59 UTC | 2020-04-29 00:34:56 UTC
ABC053769A | 2020-04-29 00:34:56 UTC | 2020-04-30 00:59:58 UTC
ABC053793A | 2020-04-30 00:59:58 UTC | 2020-05-01 00:41:54 UTC
ABC053827A | 2020-05-01 00:41:54 UTC | 2020-05-05 00:53:55 UTC
ABC053876A | 2020-05-05 00:53:55 UTC | 2020-05-06 04:10:55 UTC
ABC053892A | 2020-05-06 04:10:55 UTC | 2020-05-07 06:22:56 UTC
ABC053918A | 2020-05-07 06:22:56 UTC | 2020-05-08 06:02:55 UTC
ABC053942A | 2020-05-08 06:02:55 UTC | 2020-05-11 06:43:42 UTC
ABC053967A | 2020-05-11 06:43:42 UTC | 2020-05-12 07:01:57 UTC
ABC053991A | 2020-05-12 07:01:57 UTC | 2020-05-13 05:08:47 UTC
ABC054007A | 2020-05-13 05:08:47 UTC | 2020-05-14 03:36:55 UTC
ABC054023A | 2020-05-14 03:36:55 UTC | 2020-05-15 02:32:58 UTC
ABC054064A | 2020-05-15 02:32:58 UTC | 2020-05-18 04:32:57 UTC")
BatchId[c("dateTime", "nextDate")] <-
  lapply(BatchId[c("dateTime", "nextDate")], as.POSIXct, tz = "UTC")

Upvotes: 3

Oliver
Oliver

Reputation: 8572

Ignoring the error with your problem, you can completely remove one loop. However it is not clear what you'd expect for the case where the statement has multiple matches.

if(nrow(BatchId)!=0){
  for(i in 1:nrow(Time)){
    idx <- which(Time[i,"CentralTime"] < BatchId[,"nextDate"] & 
                 Time[i,"CentralTime"] > BatchId[,"dateTime"])
    if(length(idx) > 1)
      stop('more than one match what should I do?')
    Time[i, 'batchId'] <- BatchId[idx, "Batch Id"]
  }
}

@revans answer however is a better option, both for speed and memory usage.

Upvotes: 0

Related Questions