anthr
anthr

Reputation: 1036

How to combine R dataframes based constraints on a time column

I have two R tables, each with a list of users and a timestamp corresponding to the time that they took a certain action.

The first of these (df1) two tables has an exhaustive list of the users, and users will have multiple rows with different timestamps.

The second (df2) will have a more limited list of users, but again users will be in the table multiple times with different timestamps.

What I'd like to be able to do is join the two tables and end up with a table that matched the user in df1 with the closest timestamp in df2, as long as the timestamp in df2 happened after the one in df1.

For example, if I had two tables like:

df1 <- data.frame(c(1,1,2,3), as.POSIXct(c('2016-12-01 08:53:20', '2016-12-01 12:45:47', '2016-12-01 15:34:54', '2016-12-01 00:49:50')))
names(df1) <- c('user', 'time')

df2 <- data.frame(c(1,1,3), as.POSIXct(c('2016-12-01 07:11:01', '2016-   12-01 11:50:11', '2016-12-01 01:19:10')))
names(df2) <- c('user', 'time')

Giving us:

> df1
  user                time
1    1 2016-12-01 08:53:20
2    1 2016-12-01 12:45:47
3    2 2016-12-01 15:34:54
4    3 2016-12-01 00:49:50

> df2
  user                time
1    1 2016-12-01 07:11:01
2    1 2016-12-01 11:50:11
3    3 2016-12-01 01:19:10

The output I'd hope to get to would look like:

user              time_1                 time_2
 1   2016-12-01 08:53:20    2016-12-01 11:50:11
 1   2016-12-01 12:45:47    NA
 2   2016-12-01 15:34:54    NA
 3   2016-12-01 00:49:50    2016-12-01 01:19:10

I'm pretty stuck on the way to get this join to work out. As an additional layer of complexity I'd love it if there was a parameter that controlled the time window to allow a match (ie only join the row from df2 if it is within X minutes of df1), but really this is secondary to the main problem.

Upvotes: 2

Views: 299

Answers (2)

lmo
lmo

Reputation: 38510

Here is a data.table solution.

# load data.table and make cast data.frames as data.tables
library(data.table)
setDT(df1)
setDT(df2)

# add time variables, perform join and removing merging time variable
dfDone <- df2[, time2 := time][df1[, time1 := time],
              on=.(user, time > time)][, time:= NULL]

dfDone
   user               time2               time1
1:    1 2016-12-01 11:50:11 2016-12-01 08:53:20
2:    1                <NA> 2016-12-01 12:45:47
3:    2                <NA> 2016-12-01 15:34:54
4:    3 2016-12-01 01:19:10 2016-12-01 00:49:50

If you want to order the columns, you could use setcolorder

setcolorder(dfDone, c("user", "time1", "time2"))

dfDone
   user               time1               time2
1:    1 2016-12-01 08:53:20 2016-12-01 11:50:11
2:    1 2016-12-01 12:45:47                <NA>
3:    2 2016-12-01 15:34:54                <NA>
4:    3 2016-12-01 00:49:50 2016-12-01 01:19:10

Upvotes: 4

Nick Criswell
Nick Criswell

Reputation: 1743

Part 1 - Original Question

The first part of your question can be answered with the sqldf package.

library(sqldf)
df3 <- sqldf("SELECT * FROM df1 a 
             LEFT JOIN df2 b ON a.time < b.time 
             AND a.user = b.user")[,c(1:2, 4)]

#rename to match OP post
names(df3) <- c("user", "time_1", "time_2")

> df3
  user              time_1              time_2
1    1 2016-12-01 08:53:20 2016-12-01 11:50:11
2    1 2016-12-01 12:45:47                <NA>
3    2 2016-12-01 15:34:54                <NA>
4    3 2016-12-01 00:49:50 2016-12-01 01:19:10

Part 2 - Time Window

If you want a window of time to allow for the match, you can subtract seconds within the SQL statement as follows:

df3 <- sqldf("SELECT * FROM df1 a 
             LEFT JOIN df2 b ON a.time < (b.time - 10000)
             AND a.user = b.user")[,c(1:2, 4)]
> df3
  user                time              time.1
1    1 2016-12-01 08:53:20 2016-12-01 11:50:11
2    1 2016-12-01 12:45:47                <NA>
3    2 2016-12-01 15:34:54                <NA>
4    3 2016-12-01 00:49:50                <NA>

Note, whatever you select from b.time will be in seconds.

Upvotes: 5

Related Questions