AliCivil
AliCivil

Reputation: 2053

How to update a data.frame based on information from another data.frame

I have two tables: Display and Review. The Review table contains information on reviews on products of an online store. Each row represents the date of the review as well as the cumulative number of reviews and the average rating for the product up to the date.

page_id<-c("1072659", "1072659" , "1072659","1072650","1072660","1072660")  
review_id<-c("1761023","1761028","1762361","1918387","1761427","1863914")
date<-as.Date(c("2013-07-11","2013-08-12","2014-07-15","2014-09-10","2013-07-27","2014-08-12"),format = "%Y-%m-%d")
cumulative_No_reviews<-c(1,2,3,1,1,2)
average_rating<-c(5,3.5,4,3,5,5)
Review<-data.frame(page_id,review_id,date,cumulative_No_reviews,average_rating)
page_id        review_id          date    cumulative_No_reviews   average_rating
1072659          1761023        2013-07-11      1                       5
1072659          1761028        2013-08-12      2                       3.5
1072659          1762361        2014-07-15      3                       4
1072650          1918387        2014-09-10      1                       3
1072660          1761427        2013-07-27      1                       5
1072660          1863914        2014-08-12      2                       5

The Display table captures the data on customers’ visit to product pages.

page_id<-c("1072659","1072659","1072659","1072650","1072650","1072660","1072660","1072660")
date<-as.Date(c("2013-07-10","2013-08-03","2015-02-11","2014-08-10","2014-09-09","2013-08-12","2014-09-12","2015-08-12"),format = "%Y-%m-%d")
Display<-data.frame(page_id,date)
page_id         date        
1072659     2013-07-10      
1072659     2013-08-03      
1072659     2015-02-11      
1072650     2014-08-10  
1072650     2014-09-09      
1072660     2013-08-12      
1072660     2014-09-12      
1072660     2015-08-12      

I’d like to add two column to the Display table (call it Display2) in a way that it reflects the latest review information up the point of visit for each product, as follows:

page_id<-c("1072659","1072659","1072659","1072650","1072650","1072660","1072660","1072660")
date<-as.Date(c("2013-07-10","2013-08-03","2015-02-11","2014-08-10","2014-09-09","2013-08-12","2014-09-12","2015-08-12"),format = "%Y-%m-%d")
cumulative_No_reviews<-c(0,1,3,0,0,1,2,2)
average_rating<-c(NA,5,4,NA,NA,5,5,5)
Display2<-data.frame(page_id,date,cumulative_No_reviews,average_rating)
 page_id            date        cumulative_No_reviews   average_rating
 1072659        2013-07-10                 0                NA
 1072659        2013-08-03                 1                5
 1072659        2015-02-11                 3                4
 1072650        2014-08-10                 0                NA
 1072650        2014-09-09                 0                NA
 1072660        2013-08-14                 1                5
 1072660        2014-09-11                 2                5
 1072660        2015-08-12                 2                5

I would appreciate your help with this.

Upvotes: 0

Views: 34

Answers (1)

IceCreamToucan
IceCreamToucan

Reputation: 28695

You can do this with a data.table join. You can join the Review table with the Display table on the condition that the page_ids match and the Review date is less than the Display date. For some rows of Display there will be multiple rows of Review which match according to these conditions, so with mult = 'last' we're just picking the last one. Since Review is sorted by date, this means the one with the most recent date.

library(data.table) # 1.12.6 for nafill (used below)
setDT(Display)
setDT(Review)

Display2 <- Review[Display, on = .(page_id, date < date), mult = 'last']
Display2
#    page_id review_id       date cumulative_No_reviews average_rating
# 1: 1072659      <NA> 2013-07-10                    NA             NA
# 2: 1072659   1761023 2013-08-03                     1              5
# 3: 1072659   1762361 2015-02-11                     3              4
# 4: 1072650      <NA> 2014-08-10                    NA             NA
# 5: 1072650      <NA> 2014-09-09                    NA             NA
# 6: 1072660   1761427 2013-08-12                     1              5
# 7: 1072660   1863914 2014-09-12                     2              5
# 8: 1072660   1863914 2015-08-12                     2              5

Now this output almost matches what you show in the question, we just need to remove the review_id column and replace NAs in the cumulative_No_reviews column with 0s.

Display2[, review_id := NULL]
Display2[, cumulative_No_reviews := nafill(cumulative_No_reviews, fill = 0)][]
#    page_id       date cumulative_No_reviews average_rating
# 1: 1072659 2013-07-10                     0             NA
# 2: 1072659 2013-08-03                     1              5
# 3: 1072659 2015-02-11                     3              4
# 4: 1072650 2014-08-10                     0             NA
# 5: 1072650 2014-09-09                     0             NA
# 6: 1072660 2013-08-12                     1              5
# 7: 1072660 2014-09-12                     2              5
# 8: 1072660 2015-08-12                     2              5

Upvotes: 3

Related Questions