Reputation: 1036
I have two R tables, each with a list of users and a timestamp corresponding to the time that they took a certain action.
The first of these (df1
) two tables has an exhaustive list of the users, and users will have multiple rows with different timestamps.
The second (df2
) will have a more limited list of users, but again users will be in the table multiple times with different timestamps.
What I'd like to be able to do is join the two tables and end up with a table that matched the user in df1
with the closest timestamp in df2
, as long as the timestamp in df2
happened after the one in df1
.
For example, if I had two tables like:
df1 <- data.frame(c(1,1,2,3), as.POSIXct(c('2016-12-01 08:53:20', '2016-12-01 12:45:47', '2016-12-01 15:34:54', '2016-12-01 00:49:50')))
names(df1) <- c('user', 'time')
df2 <- data.frame(c(1,1,3), as.POSIXct(c('2016-12-01 07:11:01', '2016- 12-01 11:50:11', '2016-12-01 01:19:10')))
names(df2) <- c('user', 'time')
Giving us:
> df1
user time
1 1 2016-12-01 08:53:20
2 1 2016-12-01 12:45:47
3 2 2016-12-01 15:34:54
4 3 2016-12-01 00:49:50
> df2
user time
1 1 2016-12-01 07:11:01
2 1 2016-12-01 11:50:11
3 3 2016-12-01 01:19:10
The output I'd hope to get to would look like:
user time_1 time_2
1 2016-12-01 08:53:20 2016-12-01 11:50:11
1 2016-12-01 12:45:47 NA
2 2016-12-01 15:34:54 NA
3 2016-12-01 00:49:50 2016-12-01 01:19:10
I'm pretty stuck on the way to get this join to work out. As an additional layer of complexity I'd love it if there was a parameter that controlled the time window to allow a match (ie only join the row from df2
if it is within X
minutes of df1
), but really this is secondary to the main problem.
Upvotes: 2
Views: 299
Reputation: 38510
Here is a data.table
solution.
# load data.table and make cast data.frames as data.tables
library(data.table)
setDT(df1)
setDT(df2)
# add time variables, perform join and removing merging time variable
dfDone <- df2[, time2 := time][df1[, time1 := time],
on=.(user, time > time)][, time:= NULL]
dfDone
user time2 time1
1: 1 2016-12-01 11:50:11 2016-12-01 08:53:20
2: 1 <NA> 2016-12-01 12:45:47
3: 2 <NA> 2016-12-01 15:34:54
4: 3 2016-12-01 01:19:10 2016-12-01 00:49:50
If you want to order the columns, you could use setcolorder
setcolorder(dfDone, c("user", "time1", "time2"))
dfDone
user time1 time2
1: 1 2016-12-01 08:53:20 2016-12-01 11:50:11
2: 1 2016-12-01 12:45:47 <NA>
3: 2 2016-12-01 15:34:54 <NA>
4: 3 2016-12-01 00:49:50 2016-12-01 01:19:10
Upvotes: 4
Reputation: 1743
The first part of your question can be answered with the sqldf
package.
library(sqldf)
df3 <- sqldf("SELECT * FROM df1 a
LEFT JOIN df2 b ON a.time < b.time
AND a.user = b.user")[,c(1:2, 4)]
#rename to match OP post
names(df3) <- c("user", "time_1", "time_2")
> df3
user time_1 time_2
1 1 2016-12-01 08:53:20 2016-12-01 11:50:11
2 1 2016-12-01 12:45:47 <NA>
3 2 2016-12-01 15:34:54 <NA>
4 3 2016-12-01 00:49:50 2016-12-01 01:19:10
If you want a window of time to allow for the match, you can subtract seconds within the SQL
statement as follows:
df3 <- sqldf("SELECT * FROM df1 a
LEFT JOIN df2 b ON a.time < (b.time - 10000)
AND a.user = b.user")[,c(1:2, 4)]
> df3
user time time.1
1 1 2016-12-01 08:53:20 2016-12-01 11:50:11
2 1 2016-12-01 12:45:47 <NA>
3 2 2016-12-01 15:34:54 <NA>
4 3 2016-12-01 00:49:50 <NA>
Note, whatever you select from b.time
will be in seconds.
Upvotes: 5