Reputation: 2077
I already asked this question
But now my current problem is slightly different that I couldn't use this solution or figure out . I'd like the data from data set 1 that occurred before dataset 2 and this is my data:
# Dataset 1 (dts1)
UserID date Hour Events
1 5 25/07/2016 02:31 8
2 5 30/07/2016 02:42 6
3 4 23/07/2016 07:52 9
4 14 24/07/2016 03:02 5
5 17 25/07/2016 09:12 10
6 4 22/07/2016 03:22 4
and
# Dataset 2 (dts2)
UserID date Hour transactions
1 5 25/07/2016 02:29 4
2 4 24/07/2016 02:42 2
3 5 25/07/2016 02:52 3
4 6 24/07/2016 03:02 4
5 6 25/07/2016 03:12 1
6 14 26/07/2016 03:22 3
So, I wish to compare those dataset from dataset 1 and only add those happened before dataset 2. In other words, I want to make sure that I do not count those events that happened after the last transaction of a user. The Ideal output is as follows:
#output
UserID Events transaction
5 8 4,3
4 9,4 2
14 5 3
17 10 NA
In the above example, I made sure that I removed event 6 for user 5 because it happened after his last transaction.
Upvotes: 0
Views: 86
Reputation: 7455
This is a modification of @dimitris_ps answer for your previous question. If he chooses to answer, I will gladly delete mine.
The main difference between this problem and your previous problem is that we now want all dts1
events that are before the last dts2
transaction for each specific UserID
. Therefore, we want to group_by
the UserID
first and then filter
for only those rows where the dts1
event times are less than the last dts2
transaction time. Then we can summarise
both unique Events
and transactions
, still grouped by UserID
.
The code is:
library(dplyr)
## I will not use the lubridate package, instead I will convert the time
## using as.POSIXct
dts1$time <- as.POSIXct(paste(dts1$date, dts1$Hour), format="%d/%m/%Y %H:%M")
dts2$time <- as.POSIXct(paste(dts2$date, dts2$Hour), format="%d/%m/%Y %H:%M")
# first join the two data.frames by UserID.
result <- left_join(dts1, dts2, by="UserID") %>%
# all subsequent processing is grouped by the UserID because we
# want to compare the last transaction time to the Event times
# for each UserID.
group_by(UserID) %>%
# apply the filtering condition dts1 Event must be before last dts2 transaction.
# Note that we keep rows for which there is no row in
# dts2 for a UserID in dts1. This is the case for UserID=17.
filter(is.na(time.y) | last(time.y) > time.x) %>%
# summarise Events and transactions
summarise(Events = toString(unique(Events)), transactions = toString(unique(transactions)))
The results are:
print(result)
## A tibble: 4 x 3
## UserID Events transactions
## <int> <chr> <chr>
##1 4 9, 4 2
##2 5 8 4, 3
##3 14 5 3
##4 17 10 NA
Hope this helps.
Upvotes: 2
Reputation: 13138
We first convert the time to POSIX class.
dts1$time <- strptime(paste(dts1$date, dts1$Hour), format="%d/%m/%Y %H:%M")
dts2$time <- strptime(paste(dts2$date, dts2$Hour), format="%d/%m/%Y %H:%M")
Next step is to make the transactions
column. We do so by first sorting dts2
by time (from latest to earliest), then using the by()
function to subset dts
by UserID
and taking the first row from each subset. We then use tapply
to retrieve transactions
for each UserID
.
dts2 <- dts2[order(dts2$time, decreasing=TRUE), ]
out <- do.call(rbind, by(dts2[,c("UserID","time")], dts2$UserID, head, 1))
out$transactions <- tapply(dts2$transactions, dts2$UserID, c)
Finally, we build the Events
column using the rule you described.
out$Events <- sapply(1:nrow(out), function(i) {
User2 <- out$UserID[i]
time2 <- out$time[i]
rows <- which(dts1$UserID==User2 & dts1$time<time2)
if (length(rows)>0) {
dts1$Events[rows]
} else {
NA
}
})
Result:
> out
UserID time transactions Events
4 4 2016-07-24 02:42:00 2 9, 4
5 5 2016-07-25 02:52:00 3, 4 8
6 6 2016-07-25 03:12:00 1, 4 NA
14 14 2016-07-26 03:22:00 3 5
Note that since User 17 is not in dts2
, it doesn't appear in out
.
Upvotes: 2