Reputation: 2053
I have two tables: Display
and Review
. The Review
table contains information on reviews on products of an online store. Each row represents the date of the review as well as the cumulative number of reviews and the average rating for the product up to the date.
page_id<-c("1072659", "1072659" , "1072659","1072650","1072660","1072660")
review_id<-c("1761023","1761028","1762361","1918387","1761427","1863914")
date<-as.Date(c("2013-07-11","2013-08-12","2014-07-15","2014-09-10","2013-07-27","2014-08-12"),format = "%Y-%m-%d")
cumulative_No_reviews<-c(1,2,3,1,1,2)
average_rating<-c(5,3.5,4,3,5,5)
Review<-data.frame(page_id,review_id,date,cumulative_No_reviews,average_rating)
page_id review_id date cumulative_No_reviews average_rating
1072659 1761023 2013-07-11 1 5
1072659 1761028 2013-08-12 2 3.5
1072659 1762361 2014-07-15 3 4
1072650 1918387 2014-09-10 1 3
1072660 1761427 2013-07-27 1 5
1072660 1863914 2014-08-12 2 5
The Display
table captures the data on customers’ visit to product pages.
page_id<-c("1072659","1072659","1072659","1072650","1072650","1072660","1072660","1072660")
date<-as.Date(c("2013-07-10","2013-08-03","2015-02-11","2014-08-10","2014-09-09","2013-08-12","2014-09-12","2015-08-12"),format = "%Y-%m-%d")
Display<-data.frame(page_id,date)
page_id date
1072659 2013-07-10
1072659 2013-08-03
1072659 2015-02-11
1072650 2014-08-10
1072650 2014-09-09
1072660 2013-08-12
1072660 2014-09-12
1072660 2015-08-12
I’d like to add two column to the Display
table (call it Display2
) in a way that it reflects the latest review information up the point of visit for each product, as follows:
page_id<-c("1072659","1072659","1072659","1072650","1072650","1072660","1072660","1072660")
date<-as.Date(c("2013-07-10","2013-08-03","2015-02-11","2014-08-10","2014-09-09","2013-08-12","2014-09-12","2015-08-12"),format = "%Y-%m-%d")
cumulative_No_reviews<-c(0,1,3,0,0,1,2,2)
average_rating<-c(NA,5,4,NA,NA,5,5,5)
Display2<-data.frame(page_id,date,cumulative_No_reviews,average_rating)
page_id date cumulative_No_reviews average_rating
1072659 2013-07-10 0 NA
1072659 2013-08-03 1 5
1072659 2015-02-11 3 4
1072650 2014-08-10 0 NA
1072650 2014-09-09 0 NA
1072660 2013-08-14 1 5
1072660 2014-09-11 2 5
1072660 2015-08-12 2 5
I would appreciate your help with this.
Upvotes: 0
Views: 34
Reputation: 28695
You can do this with a data.table
join. You can join the Review
table with the Display
table on the condition that the page_id
s match and the Review
date is less than the Display
date. For some rows of Display
there will be multiple rows of Review
which match according to these conditions, so with mult = 'last'
we're just picking the last one. Since Review
is sorted by date, this means the one with the most recent date.
library(data.table) # 1.12.6 for nafill (used below)
setDT(Display)
setDT(Review)
Display2 <- Review[Display, on = .(page_id, date < date), mult = 'last']
Display2
# page_id review_id date cumulative_No_reviews average_rating
# 1: 1072659 <NA> 2013-07-10 NA NA
# 2: 1072659 1761023 2013-08-03 1 5
# 3: 1072659 1762361 2015-02-11 3 4
# 4: 1072650 <NA> 2014-08-10 NA NA
# 5: 1072650 <NA> 2014-09-09 NA NA
# 6: 1072660 1761427 2013-08-12 1 5
# 7: 1072660 1863914 2014-09-12 2 5
# 8: 1072660 1863914 2015-08-12 2 5
Now this output almost matches what you show in the question, we just need to remove the review_id
column and replace NA
s in the cumulative_No_reviews
column with 0
s.
Display2[, review_id := NULL]
Display2[, cumulative_No_reviews := nafill(cumulative_No_reviews, fill = 0)][]
# page_id date cumulative_No_reviews average_rating
# 1: 1072659 2013-07-10 0 NA
# 2: 1072659 2013-08-03 1 5
# 3: 1072659 2015-02-11 3 4
# 4: 1072650 2014-08-10 0 NA
# 5: 1072650 2014-09-09 0 NA
# 6: 1072660 2013-08-12 1 5
# 7: 1072660 2014-09-12 2 5
# 8: 1072660 2015-08-12 2 5
Upvotes: 3