How to combine R dataframes based constraints on a time column

Question

I have two R tables, each with a list of users and a timestamp corresponding to the time that they took a certain action.

The first of these (df1) two tables has an exhaustive list of the users, and users will have multiple rows with different timestamps.

The second (df2) will have a more limited list of users, but again users will be in the table multiple times with different timestamps.

What I'd like to be able to do is join the two tables and end up with a table that matched the user in df1 with the closest timestamp in df2, as long as the timestamp in df2 happened after the one in df1.

For example, if I had two tables like:

df1 <- data.frame(c(1,1,2,3), as.POSIXct(c('2016-12-01 08:53:20', '2016-12-01 12:45:47', '2016-12-01 15:34:54', '2016-12-01 00:49:50')))
names(df1) <- c('user', 'time')

df2 <- data.frame(c(1,1,3), as.POSIXct(c('2016-12-01 07:11:01', '2016-   12-01 11:50:11', '2016-12-01 01:19:10')))
names(df2) <- c('user', 'time')

Giving us:

> df1
  user                time
1    1 2016-12-01 08:53:20
2    1 2016-12-01 12:45:47
3    2 2016-12-01 15:34:54
4    3 2016-12-01 00:49:50

> df2
  user                time
1    1 2016-12-01 07:11:01
2    1 2016-12-01 11:50:11
3    3 2016-12-01 01:19:10

The output I'd hope to get to would look like:

user              time_1                 time_2
 1   2016-12-01 08:53:20    2016-12-01 11:50:11
 1   2016-12-01 12:45:47    NA
 2   2016-12-01 15:34:54    NA
 3   2016-12-01 00:49:50    2016-12-01 01:19:10

I'm pretty stuck on the way to get this join to work out. As an additional layer of complexity I'd love it if there was a parameter that controlled the time window to allow a match (ie only join the row from df2 if it is within X minutes of df1), but really this is secondary to the main problem.

Nick Criswell · Accepted Answer

Part 1 - Original Question

The first part of your question can be answered with the sqldf package.

library(sqldf)
df3 <- sqldf("SELECT * FROM df1 a 
             LEFT JOIN df2 b ON a.time < b.time 
             AND a.user = b.user")[,c(1:2, 4)]

#rename to match OP post
names(df3) <- c("user", "time_1", "time_2")

> df3
  user              time_1              time_2
1    1 2016-12-01 08:53:20 2016-12-01 11:50:11
2    1 2016-12-01 12:45:47                
3    2 2016-12-01 15:34:54                
4    3 2016-12-01 00:49:50 2016-12-01 01:19:10

Part 2 - Time Window

If you want a window of time to allow for the match, you can subtract seconds within the SQL statement as follows:

df3 <- sqldf("SELECT * FROM df1 a 
             LEFT JOIN df2 b ON a.time < (b.time - 10000)
             AND a.user = b.user")[,c(1:2, 4)]
> df3
  user                time              time.1
1    1 2016-12-01 08:53:20 2016-12-01 11:50:11
2    1 2016-12-01 12:45:47                
3    2 2016-12-01 15:34:54                
4    3 2016-12-01 00:49:50

Note, whatever you select from b.time will be in seconds.

How to combine R dataframes based constraints on a time column

Answers (2)

Part 1 - Original Question

Part 2 - Time Window

Related Questions