haimen
haimen

Reputation: 2015

find the shortest time difference between two dataframes

Suppose I have two dataframes,

df1

id        time1
1         2016-04-07 21:39:10
1         2016-04-05 11:19:17
2         2016-04-03 10:58:25
2         2016-04-02 21:39:10

df2

id        time2
1         2016-04-07 21:39:11
1         2016-04-05 11:19:18
1         2016-04-06 21:39:11
1         2016-04-04 11:19:18
2         2016-04-03 10:58:26
2         2016-04-02 21:39:11
2         2016-04-04 10:58:26
2         2016-04-05 21:39:11

I want to find for each entry in df1, the shortest time difference in df2. Suppose we take the first entry, it has id 1, so I want to loop through df2, filter for id 1, then check the time difference between one entry of df1 and remaining entries of df2 and find the shortest difference and fetch the corresponding entry. My sample output should be

id        time                   time2                    diff(in secs)
1         2016-04-07 21:39:10    2016-04-07 21:39:10        1
1         2016-04-05 11:19:17    2016-04-05 11:19:17        1
2         2016-04-03 10:58:25    2016-04-03 10:58:25        1
2         2016-04-02 21:39:10    2016-04-02 21:39:10        1

the following is my try,

for(i in unique(df1$id)){
  temp1 = df1[df1$id == i,]
  temp2 = df2[df2$id == i,]
  for(j in unique(df1$time1){
     for(k in unique(df2$time2){
        diff = abs(df1$time1[j] - df2$time2[k]
        print(diff)}}}

I am not able to progress after this, getting many errors. Can anybody help me in correcting this? May be suggest a better efficient way to do this? Any help would be appreciated.

Update:

Reproducable data:

    df1 <- data.frame(
        id = c(1,1,2,2),
        time1 = c('2016-04-07 21:39:10', '2016-04-05 11:19:17', '2016-04-03 10:58:25', '2016-04-02 21:39:10')
    )

    df2 <- data.frame(
        id = c(1,1,1,1,2,2,2,2),
        time2 = c('2016-04-07 21:39:11', '2016-04-05 11:19:18','2016-04-07 21:39:11', '2016-04-05 11:19:18', '2016-04-03 10:58:26', '2016-04-02 21:39:11','2016-04-03 10:58:26', '2016-04-02 21:39:11')
    )

df1$time1 =  as.POSIXct(df1$time1)
df2$time2 = as.POSIXct(df2$time2)

Upvotes: 1

Views: 199

Answers (3)

Bulat
Bulat

Reputation: 6969

If you work with data.table:

library(data.table)
df1 <- data.table(
  id = c(1,1,2,2),
  time1 = c('2016-04-07 21:39:10', '2016-04-05 11:19:17', '2016-04-03 10:58:25', '2016-04-02 21:39:10')
)

df2 <- data.table(
  id = c(1,1,1,1,2,2,2,2),
  time2 = c('2016-04-07 21:39:11', '2016-04-05 11:19:18','2016-04-07 21:39:11', '2016-04-05 11:19:18', '2016-04-03 10:58:26', '2016-04-02 21:39:11','2016-04-03 10:58:26', '2016-04-02 21:39:11')
)

df1$time1 =  as.POSIXct(df1$time1)
df2$time2 = as.POSIXct(df2$time2)

res <- df1[df2, .(time1, time2), by  = .EACHI, on = "id"][, diff:= abs(time2 -time1)]
setkey(res, id, time1, diff)
res <- res[, row := seq_along(.I), by = .(id, time1)][row == 1]

Upvotes: 0

ajrwhite
ajrwhite

Reputation: 8458

You can also do this in base R. To generate random dates (useful), I borrowed and edited a nice function from elsewhere on StackOverflow:

latemail <- function(N, st="2011/01/01", et="2016/12/31") {
  st <- as.POSIXct(as.Date(st))
  et <- as.POSIXct(as.Date(et))
  dt <- as.numeric(difftime(et,st,unit="sec"))
  ev <- sort(runif(N, 0, dt))
  return(st + ev)
}
df1 <- data.frame(id=c(1,1,2,2), time1=latemail(4))
df2 <- data.frame(id=c(rep(1,4), rep(2,4)), time2=latemail(8))

And then your answer can be achieved in two lines:

shortest <- sapply(df1$time1, function(x) which(abs(df2$time2 - x) == min(abs(df2$time2 - x))))
cbind(df1, df2[shortest,])

Output:

id               time1 id               time2
 1 2011-10-08 02:00:21  1 2011-08-17 18:07:47
 1 2012-05-06 17:49:03  1 2012-09-04 19:52:40
 2 2013-10-29 13:14:51  1 2012-10-29 20:09:31
 2 2016-06-17 19:23:43  2 2015-11-24 02:07:15

Upvotes: 1

chappers
chappers

Reputation: 2415

You can achieve this using dplyr. Basically the idea is since we want to produce an entry we'll assign each element in df1 a new id (in this case I just called it rowname).

After this, all we're interested in is joining the two dataframes on the id and filter them based on the minimum absolute difference.

library(dplyr)

df1$time1 <- as.POSIXct(as.character(df1$time1))
df2$time2 <- as.POSIXct(as.character(df2$time2))

df1 %>% 
  add_rownames("rowname") %>%
  left_join(df2, "id") %>% 
  mutate(diff=time2-time1) %>%
  group_by(rowname) %>%
  filter(min(abs(diff)) == abs(diff)) %>% 
  distinct

This is my output:

Source: local data frame [4 x 5]
Groups: rowname [4]

  rowname    id               time1               time2   diff
    (chr) (dbl)              (time)              (time) (dfft)
1       1     1 2016-04-07 21:39:10 2016-04-07 21:39:11 1 secs
2       2     1 2016-04-05 11:19:17 2016-04-05 11:19:18 1 secs
3       3     2 2016-04-03 10:58:25 2016-04-03 10:58:26 1 secs
4       4     2 2016-04-02 21:39:10 2016-04-02 21:39:11 1 secs      

Upvotes: 2

Related Questions