MFR
MFR

Reputation: 2077

Map two data frames in a certain condition

I already asked this question

But now my current problem is slightly different that I couldn't use this solution or figure out . I'd like the data from data set 1 that occurred before dataset 2 and this is my data:

 # Dataset 1 (dts1)

     UserID   date   Hour     Events    
  1    5  25/07/2016  02:31      8         
  2    5  30/07/2016  02:42      6      
  3    4  23/07/2016  07:52      9         
  4   14  24/07/2016  03:02      5         
  5   17  25/07/2016  09:12      10        
  6    4  22/07/2016  03:22      4  

and

   # Dataset 2 (dts2)

    UserID       date   Hour      transactions     
 1     5    25/07/2016  02:29      4         
 2     4    24/07/2016  02:42      2       
 3     5    25/07/2016  02:52      3         
 4     6    24/07/2016  03:02      4         
 5     6    25/07/2016  03:12      1        
 6    14    26/07/2016  03:22      3  

So, I wish to compare those dataset from dataset 1 and only add those happened before dataset 2. In other words, I want to make sure that I do not count those events that happened after the last transaction of a user. The Ideal output is as follows:

    #output 

   UserID   Events      transaction 

    5         8         4,3
    4         9,4       2
   14         5         3
   17         10        NA

In the above example, I made sure that I removed event 6 for user 5 because it happened after his last transaction.

Upvotes: 0

Views: 86

Answers (2)

aichao
aichao

Reputation: 7455

This is a modification of @dimitris_ps answer for your previous question. If he chooses to answer, I will gladly delete mine.

The main difference between this problem and your previous problem is that we now want all dts1 events that are before the last dts2 transaction for each specific UserID. Therefore, we want to group_by the UserID first and then filter for only those rows where the dts1 event times are less than the last dts2 transaction time. Then we can summarise both unique Events and transactions, still grouped by UserID.

The code is:

library(dplyr)

## I will not use the lubridate package, instead I will convert the time
## using as.POSIXct
dts1$time <- as.POSIXct(paste(dts1$date, dts1$Hour), format="%d/%m/%Y %H:%M")
dts2$time <- as.POSIXct(paste(dts2$date, dts2$Hour), format="%d/%m/%Y %H:%M")

# first join the two data.frames by UserID.
result <- left_join(dts1, dts2, by="UserID") %>%

# all subsequent processing is grouped by the UserID because we 
# want to compare the last transaction time to the Event times
# for each UserID.
group_by(UserID) %>%

# apply the filtering condition dts1 Event must be before last dts2 transaction.
# Note that we keep rows for which there is no row in  
# dts2 for a UserID in dts1. This is the case for UserID=17.
filter(is.na(time.y) | last(time.y) > time.x) %>% 

# summarise Events and transactions
summarise(Events = toString(unique(Events)), transactions = toString(unique(transactions)))

The results are:

print(result)
## A tibble: 4 x 3
##  UserID Events transactions
##   <int>  <chr>        <chr>
##1      4   9, 4            2
##2      5      8         4, 3
##3     14      5            3
##4     17     10           NA

Hope this helps.

Upvotes: 2

Weihuang Wong
Weihuang Wong

Reputation: 13138

We first convert the time to POSIX class.

dts1$time <- strptime(paste(dts1$date, dts1$Hour), format="%d/%m/%Y %H:%M")
dts2$time <- strptime(paste(dts2$date, dts2$Hour), format="%d/%m/%Y %H:%M")

Next step is to make the transactions column. We do so by first sorting dts2 by time (from latest to earliest), then using the by() function to subset dts by UserID and taking the first row from each subset. We then use tapply to retrieve transactions for each UserID.

dts2 <- dts2[order(dts2$time, decreasing=TRUE), ]
out <- do.call(rbind, by(dts2[,c("UserID","time")], dts2$UserID, head, 1))
out$transactions <- tapply(dts2$transactions, dts2$UserID, c)

Finally, we build the Events column using the rule you described.

out$Events <- sapply(1:nrow(out), function(i) {
    User2 <- out$UserID[i]
    time2 <- out$time[i]
    rows <- which(dts1$UserID==User2 & dts1$time<time2)
    if (length(rows)>0) {            
        dts1$Events[rows]
    } else {
        NA
    }
})

Result:

> out
   UserID                time transactions Events
4       4 2016-07-24 02:42:00            2   9, 4
5       5 2016-07-25 02:52:00         3, 4      8
6       6 2016-07-25 03:12:00         1, 4     NA
14     14 2016-07-26 03:22:00            3      5

Note that since User 17 is not in dts2, it doesn't appear in out.

Upvotes: 2

Related Questions