Reputation: 15
Time(Data Frame):
CentralTime Batch Id
2020-04-01 03:46:01 UTC
2020-04-01 10:46:01 UTC
2020-04-01 10:54:18 UTC
2020-04-01 10:54:25 UTC
2020-04-01 10:54:31 UTC
2020-04-01 10:55:06 UTC
2020-04-01 10:55:12 UTC
2020-04-01 10:55:26 UTC
2020-04-01 10:55:32 UTC
2020-04-01 10:55:39 UTC
2020-04-01 10:55:45 UTC
2020-04-01 10:56:20 UTC
2020-04-01 10:56:26 UTC
2020-04-01 10:56:33 UTC
2020-04-01 10:56:39 UTC
2020-04-01 10:56:53 UTC
2020-04-01 10:56:59 UTC
2020-04-01 10:57:06 UTC
2020-04-01 10:57:14 UTC
2020-04-01 10:57:20 UTC
2020-04-01 11:37:20 UTC
2020-04-01 11:38:27 UTC
2020-04-01 11:38:33 UTC
2020-04-01 11:38:47 UTC
2020-04-01 11:38:53 UTC
2020-04-01 11:39:15 UTC
2020-04-01 11:39:27 UTC
2020-04-01 11:39:41 UTC
2020-04-01 11:39:47 UTC
2020-04-01 11:39:54 UTC
2020-04-01 11:40:00 UTC
2020-04-01 11:40:28 UTC
2020-04-01 17:30:28 UTC
2020-04-01 17:36:18 UTC
2020-04-02 00:26:18 UTC
2020-04-02 00:28:46 UTC
2020-04-02 00:29:20 UTC
2020-04-02 00:29:28 UTC
2020-04-02 00:29:34 UTC
2020-04-02 00:29:41 UTC
2020-04-02 00:29:47 UTC
2020-04-02 00:30:01 UTC
2020-04-02 00:30:07 UTC
2020-04-02 00:30:21 UTC
2020-04-02 00:30:27 UTC
2020-04-02 00:30:35 UTC
2020-04-02 00:30:42 UTC
2020-04-02 00:30:48 UTC
2020-04-02 00:30:55 UTC
2020-04-02 00:31:01 UTC
2020-04-02 00:31:15 UTC
BatchId(Data Frame):
Batch Id dateTime nextDate
ABC053272A 2020-04-01 00:00:48 UTC 2020-04-02 00:29:47 UTC
ABC053314A 2020-04-02 00:29:47 UTC 2020-04-03 00:12:58 UTC
ABC053330A 2020-04-03 00:12:58 UTC 2020-04-04 01:16:54 UTC
ABC053355A 2020-04-04 01:16:54 UTC 2020-04-07 00:33:57 UTC
ABC053405A 2020-04-07 00:33:57 UTC 2020-04-08 00:46:47 UTC
ABC053421A 2020-04-08 00:46:47 UTC 2020-04-09 00:36:56 UTC
ABC053447A 2020-04-09 00:36:56 UTC 2020-04-10 01:26:55 UTC
ABC053462A 2020-04-10 01:26:55 UTC 2020-04-13 08:13:50 UTC
ABC053470 2020-04-13 08:13:50 UTC 2020-04-14 10:07:56 UTC
ABC053496A 2020-04-14 10:07:56 UTC 2020-04-15 11:08:59 UTC
ABC053520A 2020-04-15 11:08:59 UTC 2020-04-16 17:51:28 UTC
ABC053553A 2020-04-16 17:51:28 UTC 2020-04-20 04:24:53 UTC
ABC053611A 2020-04-20 04:24:53 UTC 2020-04-22 00:09:56 UTC
ABC053652A 2020-04-22 00:09:56 UTC 2020-04-22 12:05:49 UTC
ABC053652B 2020-04-22 12:05:49 UTC 2020-04-23 14:12:53 UTC
ABC053686 2020-04-23 14:12:53 UTC 2020-04-24 12:14:55 UTC
ABC053694A 2020-04-24 12:14:55 UTC 2020-04-28 00:08:59 UTC
ABC053710A 2020-04-28 00:08:59 UTC 2020-04-29 00:34:56 UTC
ABC053769A 2020-04-29 00:34:56 UTC 2020-04-30 00:59:58 UTC
ABC053793A 2020-04-30 00:59:58 UTC 2020-05-01 00:41:54 UTC
ABC053827A 2020-05-01 00:41:54 UTC 2020-05-05 00:53:55 UTC
ABC053876A 2020-05-05 00:53:55 UTC 2020-05-06 04:10:55 UTC
ABC053892A 2020-05-06 04:10:55 UTC 2020-05-07 06:22:56 UTC
ABC053918A 2020-05-07 06:22:56 UTC 2020-05-08 06:02:55 UTC
ABC053942A 2020-05-08 06:02:55 UTC 2020-05-11 06:43:42 UTC
ABC053967A 2020-05-11 06:43:42 UTC 2020-05-12 07:01:57 UTC
ABC053991A 2020-05-12 07:01:57 UTC 2020-05-13 05:08:47 UTC
ABC054007A 2020-05-13 05:08:47 UTC 2020-05-14 03:36:55 UTC
ABC054023A 2020-05-14 03:36:55 UTC 2020-05-15 02:32:58 UTC
ABC054064A 2020-05-15 02:32:58 UTC 2020-05-18 04:32:57 UTC
I am trying to get values from batch id column(BatchId data frame) to batch id column(Time data frame) based on if CentralTime(Time data frame) lies in between dateTime(BatchId data frame) and nextDate(BatchId data frame)
I am using "for" loop for getting this values but its taking too much time. trying to find alternative solution. I just posted subset of data from what I have. Below is the code.
if(nrow(BatchId)!=0){
for(i in 1:nrow(Time)){
for(j in 1:nrow(BatchId)){
if (Time[i,"CentralTime"] < BatchId[j,"nextDate"] &
Time[i,"CentralTime"]> BatchId[j,"dateTime"]) {
Time[i,"batchId"]<-BatchId[j,"Batch Id"]
}
}
}
}
Upvotes: 0
Views: 63
Reputation: 160447
Rarely is a double-for
loop a necessary (or even advisable) way to approach problems in R, and this is no exception. In fact, this begs a "non-equality" join. Base R doesn't support it, and while dplyr
supports it when connected with dbplyr::sql_on
, I suggest data.table
's method:
I'll create my own Time
so that see some matches:
library(data.table)
Time <- data.frame(CentralTime = BatchId$dateTime[3] + c(0, 1000, 3000, 9000))
Time
# CentralTime
# 1: 2020-04-03 00:12:58
# 2: 2020-04-03 00:29:38
# 3: 2020-04-03 01:02:58
# 4: 2020-04-03 02:42:58
I'm assuming that neither frame is of data.table
class, so I'll be a little careful. (If you're already using data.table
, then you likely know what can be removed from this code. If not, then (1) data.table
operates in-place unlike R's default copy-on-write semantics; (2) doing this requires another attribute (memory address) that must be set before data.table
operators will work on it; and setDT
and setDF
change to and from that format, respectively. I recommend against keeping it as a data.table
-class frame if you aren't doing data.table
stuff on and to it, since there are some base R frame behaviors that do change.)
library(data.table)
setDT(BatchId)
setDT(Time)
out <- BatchId[Time, on = .(dateTime <= CentralTime, nextDate >= CentralTime)]
out <- out[, .(CentralTime = dateTime, BatchId)]
setDF(out)
out
# CentralTime BatchId
# 1 2020-04-03 00:12:58 ABC053314A
# 2 2020-04-03 00:12:58 ABC053330A
# 3 2020-04-03 00:29:38 ABC053330A
# 4 2020-04-03 01:02:58 ABC053330A
# 5 2020-04-03 02:42:58 ABC053330A
Some notes about how data.table
does merges:
DT1[DT2, on = ...]
is a left-join. Disregarding for a moment the non-equi join, this method is similar to
### base R
merge(DT2, DT1, ...)
### dplyr
right_join(DT1, DT2, ...)
left_join(DT2, DT1, ...)
the time field in the "left" frame (Time
, the DT1
in my previous example) is renamed to be the first of the non-equi fields used in the other frame, so if you look at out
immediately after the join, it has columns BatchId
, dateTime
(even though these values are not necessarily equal to any within BatchId$dateTime
... confusing), and nextDate
And not unique to data.table
, this join produces two rows for one of the times, since ids ABC053314A
and ABC053330A
overlap:
subset(BatchId, BatchId %in% c("ABC053314A", "ABC053330A"))
# BatchId dateTime nextDate
# 1: ABC053314A 2020-04-02 00:29:47 2020-04-03 00:12:58
# 2: ABC053330A 2020-04-03 00:12:58 2020-04-04 01:16:54
a <- subset(BatchId, BatchId %in% c("ABC053314A", "ABC053330A"))
a$nextDate[1] == a$dateTime[2]
# [1] TRUE
(which may not always be perfectly equal, since they are effectively floating-point numbers).
If you have a strict inequality on one side, then that reduces this expansion:
setDT(BatchId)
setDT(Time)
out <- BatchId[Time, on = .(dateTime <= CentralTime, nextDate > CentralTime)]
out <- out[, .(CentralTime = dateTime, BatchId)]
setDF(out)
out
# CentralTime BatchId
# 1 2020-04-03 00:12:58 ABC053330A
# 2 2020-04-03 00:29:38 ABC053330A
# 3 2020-04-03 01:02:58 ABC053330A
# 4 2020-04-03 02:42:58 ABC053330A
### cleanup
setDF(BatchId)
setDF(Time)
Data:
BatchId <- read.table(header = TRUE, sep = "|", text = "
BatchId | dateTime | nextDate
ABC053272A | 2020-04-01 00:00:48 UTC | 2020-04-02 00:29:47 UTC
ABC053314A | 2020-04-02 00:29:47 UTC | 2020-04-03 00:12:58 UTC
ABC053330A | 2020-04-03 00:12:58 UTC | 2020-04-04 01:16:54 UTC
ABC053355A | 2020-04-04 01:16:54 UTC | 2020-04-07 00:33:57 UTC
ABC053405A | 2020-04-07 00:33:57 UTC | 2020-04-08 00:46:47 UTC
ABC053421A | 2020-04-08 00:46:47 UTC | 2020-04-09 00:36:56 UTC
ABC053447A | 2020-04-09 00:36:56 UTC | 2020-04-10 01:26:55 UTC
ABC053462A | 2020-04-10 01:26:55 UTC | 2020-04-13 08:13:50 UTC
ABC053470 | 2020-04-13 08:13:50 UTC | 2020-04-14 10:07:56 UTC
ABC053496A | 2020-04-14 10:07:56 UTC | 2020-04-15 11:08:59 UTC
ABC053520A | 2020-04-15 11:08:59 UTC | 2020-04-16 17:51:28 UTC
ABC053553A | 2020-04-16 17:51:28 UTC | 2020-04-20 04:24:53 UTC
ABC053611A | 2020-04-20 04:24:53 UTC | 2020-04-22 00:09:56 UTC
ABC053652A | 2020-04-22 00:09:56 UTC | 2020-04-22 12:05:49 UTC
ABC053652B | 2020-04-22 12:05:49 UTC | 2020-04-23 14:12:53 UTC
ABC053686 | 2020-04-23 14:12:53 UTC | 2020-04-24 12:14:55 UTC
ABC053694A | 2020-04-24 12:14:55 UTC | 2020-04-28 00:08:59 UTC
ABC053710A | 2020-04-28 00:08:59 UTC | 2020-04-29 00:34:56 UTC
ABC053769A | 2020-04-29 00:34:56 UTC | 2020-04-30 00:59:58 UTC
ABC053793A | 2020-04-30 00:59:58 UTC | 2020-05-01 00:41:54 UTC
ABC053827A | 2020-05-01 00:41:54 UTC | 2020-05-05 00:53:55 UTC
ABC053876A | 2020-05-05 00:53:55 UTC | 2020-05-06 04:10:55 UTC
ABC053892A | 2020-05-06 04:10:55 UTC | 2020-05-07 06:22:56 UTC
ABC053918A | 2020-05-07 06:22:56 UTC | 2020-05-08 06:02:55 UTC
ABC053942A | 2020-05-08 06:02:55 UTC | 2020-05-11 06:43:42 UTC
ABC053967A | 2020-05-11 06:43:42 UTC | 2020-05-12 07:01:57 UTC
ABC053991A | 2020-05-12 07:01:57 UTC | 2020-05-13 05:08:47 UTC
ABC054007A | 2020-05-13 05:08:47 UTC | 2020-05-14 03:36:55 UTC
ABC054023A | 2020-05-14 03:36:55 UTC | 2020-05-15 02:32:58 UTC
ABC054064A | 2020-05-15 02:32:58 UTC | 2020-05-18 04:32:57 UTC")
BatchId[c("dateTime", "nextDate")] <-
lapply(BatchId[c("dateTime", "nextDate")], as.POSIXct, tz = "UTC")
Upvotes: 3
Reputation: 8572
Ignoring the error with your problem, you can completely remove one loop. However it is not clear what you'd expect for the case where the statement has multiple matches.
if(nrow(BatchId)!=0){
for(i in 1:nrow(Time)){
idx <- which(Time[i,"CentralTime"] < BatchId[,"nextDate"] &
Time[i,"CentralTime"] > BatchId[,"dateTime"])
if(length(idx) > 1)
stop('more than one match what should I do?')
Time[i, 'batchId'] <- BatchId[idx, "Batch Id"]
}
}
@revans answer however is a better option, both for speed and memory usage.
Upvotes: 0