leaRningR909
leaRningR909

Reputation: 235

R conditionally matching date-time from one dataframe to closest date-time field in second dataframe

I have two data frames, df.events and df.activ.

df.activ has very granular minute level data and an order of magnitude more records (1,000,000+) than df.events which has ~100,000 records, also at minute level granularity. The two dataframes have two common fields, DateTime and Geo. Both DateTime columns are in as.POSIXlt, %Y-%m-%d %H:%M:%S format.

df.activ <- read.table(text=
                          '"DateTime","Geo","Bin1","Bin2"
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,510,0,1
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:13:00,618,1,1
                        2014-07-01 00:13:00,510,0,1
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0',header=TRUE,sep=",")

df.events <- read.table(text=
                          '"Units","Geo","DateTime"
                        225,999,2014-07-01 00:09:00
                        40,510,2014-07-01 00:12:00
                        5,999,2014-07-01 00:28:00
                        115,999,2014-07-01 00:44:00
                        0,999,2014-07-01 00:47:00',header=TRUE,sep=",")

My goal is to merge df.activ to the nearest DateTime in df.events if the Geo field value in same row (in df.events) is 999.

If df.event's Geo is not 999 then I only want to merge on df.event if the Geo fields match (e.g. the cases where Geo = 510 in the provided data frames).

I know for-loops aren't the right way to solve things in R, but conceptually I'm looking to do a nested for-loop, by looping down df.activ's DateTime field and bringing on the record with the closest DateTime from df.events if the Geo field is 999 or matches the Geo field in df.activ.

The data frame below is what I'm after:

df.idealresults <- read.table(text=
                              'DateTime,Geo,Bin1,Bin2,events.DateTime,events.Units,Events.Geo
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,510,0,1,7/1/2014 0:12,40,510
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,618,1,1,7/1/2014 0:09,225,999
                              7/1/2014 0:13,510,0,1,7/1/2014 0:12,40,510
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999',header=TRUE,sep=',')

Thus far, I've been able to merge df.activ to the nearest DateTime in df.events. I accomplished this using a na.locf based approach inspired by the latter part of the answer to this SO post. I've struggled trying to incorporate the Geo matching logic into this approach; the nature of na.locf makes it difficult to make this work right since it relies on vector to vector NAs which are bound on before the merge step.

Upvotes: 4

Views: 1983

Answers (2)

Mysterio
Mysterio

Reputation: 11

I am at work and this seems relatively solved so I will be brief. You can also do a full outer merge, then simply take the differences in the dates. Use distinct sorted by the absolute value of the difference in the dates.

This is probably the algorithmically fastest way to do your merge but will require more RAM than for looping (your full merge will have n1*n2 observations).

Upvotes: 0

Dominic Comtois
Dominic Comtois

Reputation: 10431

It is sometimes hard to avoid loops, especially when you have conditions like you do. Sometimes we end up spending much efforts avoiding them while they are probably either the best we can do, or are not too far behind in terms of performance and/or readability. Having said that, this would do the trick:

df.activ$DateTime <- as.POSIXct(df.activ$DateTime)
df.events$DateTime <- as.POSIXct(df.events$DateTime)

results <- df.activ
results$events.Units=NA
results$events.Geo=NA
results$events.Datetime=NA

for(i in seq_len(nrow(df.activ))) {
  diffs <- order(abs(df.activ$DateTime[i] - df.events$DateTime))
  for(j in seq_along(diffs)) {
    if(df.events$Geo[diffs[j]] == 999) {
      results[i, 5:7] <- df.events[diffs[j],]
      break
    } else if(isTRUE(df.events$Geo[diffs[j]] == df.activ$Geo[i])) {
      results[i, 5:7] <- df.events[diffs[j],]
      break
    }
  }
}

results$events.DateTime <- as.POSIXct(results$events.Datetime,origin = "1970-01-01")

results
              DateTime Geo Bin1 Bin2 events.Units events.Geo events.Datetime     events.DateTime
1  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
2  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
3  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
4  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
5  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
6  2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
7  2014-07-01 00:12:00 510    0    1           40        510      1404187920 2014-07-01 00:12:00
8  2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
9  2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
10 2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
11 2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
12 2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
13 2014-07-01 00:13:00 618    1    1          225        999      1404187740 2014-07-01 00:09:00
14 2014-07-01 00:13:00 510    0    1           40        510      1404187920 2014-07-01 00:12:00
15 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
16 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
17 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
18 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
19 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
20 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
21 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00

Upvotes: 3

Related Questions